[00:40:45] PROBLEM - puppet last run on elastic1036 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:04:01] (03PS1) 10Dereckson: Update Wikiquote talk namespace in Sanskrit Wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/337345 (https://phabricator.wikimedia.org/T101634) [01:09:45] RECOVERY - puppet last run on elastic1036 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [01:15:05] PROBLEM - check_puppetrun on bellatrix is CRITICAL: CRITICAL: Puppet has 1 failures [01:20:05] RECOVERY - check_puppetrun on bellatrix is OK: OK: Puppet is currently enabled, last run 70 seconds ago with 0 failures [01:48:05] PROBLEM - Postgres Replication Lag on maps2004 is CRITICAL: CRITICAL - Rep Delay is: 1815.535171 Seconds [01:48:06] PROBLEM - Postgres Replication Lag on maps2003 is CRITICAL: CRITICAL - Rep Delay is: 1815.541351 Seconds [01:48:06] PROBLEM - Postgres Replication Lag on maps2002 is CRITICAL: CRITICAL - Rep Delay is: 1819.11785 Seconds [01:49:05] RECOVERY - Postgres Replication Lag on maps2004 is OK: OK - Rep Delay is: 28.533357 Seconds [01:49:08] RECOVERY - Postgres Replication Lag on maps2003 is OK: OK - Rep Delay is: 28.539028 Seconds [01:49:08] RECOVERY - Postgres Replication Lag on maps2002 is OK: OK - Rep Delay is: 32.23709 Seconds [02:33:25] !log l10nupdate@tin scap sync-l10n completed (1.29.0-wmf.11) (duration: 13m 18s) [02:33:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:38:43] !log l10nupdate@tin ResourceLoader cache refresh completed at Mon Feb 13 02:38:43 UTC 2017 (duration 5m 18s) [02:38:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:04:55] PROBLEM - Postgres Replication Lag on maps1004 is CRITICAL: CRITICAL - Rep Delay is: 1800.813264 Seconds [03:05:55] RECOVERY - Postgres Replication Lag on maps1004 is OK: OK - Rep Delay is: 0.0 Seconds [03:20:55] PROBLEM - puppet last run on db1071 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [03:23:55] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 710.70 seconds [03:28:55] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 21.79 seconds [03:48:55] RECOVERY - puppet last run on db1071 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [03:52:06] PROBLEM - puppet last run on mw1198 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:15:45] PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=453.60 Read Requests/Sec=547.60 Write Requests/Sec=401.70 KBytes Read/Sec=35108.80 KBytes_Written/Sec=2677.60 [04:21:05] RECOVERY - puppet last run on mw1198 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [04:24:45] RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=1.20 Read Requests/Sec=53.40 Write Requests/Sec=28.20 KBytes Read/Sec=959.60 KBytes_Written/Sec=241.60 [05:59:05] PROBLEM - puppet last run on neodymium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [06:26:05] RECOVERY - puppet last run on neodymium is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [06:42:55] PROBLEM - puppet last run on labsdb1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [07:03:21] !log Compressing commonswiki tables on labsdb1010 and labsdb1011 - T153743 [07:03:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:03:27] T153743: Add and sanitize s2, s4, s5, s6 and s7 to sanitarium2 and new labsdb hosts - https://phabricator.wikimedia.org/T153743 [07:11:55] RECOVERY - puppet last run on labsdb1004 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [07:38:37] (03CR) 10Muehlenhoff: [C: 032] Update to 4.4.48 [debs/linux44] - 10https://gerrit.wikimedia.org/r/336989 (owner: 10Muehlenhoff) [07:44:15] PROBLEM - Disk space on thumbor1002 is CRITICAL: DISK CRITICAL - free space: /srv 15905 MB (3% inode=99%) [07:55:04] !log upgrade HHVM on remaining mw servers in eqiad [07:55:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:10:53] !log removed empty log files from elastic1022,1024,2001,1026,1040 to fix logrotate cronspam [08:10:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:32:29] there is a file in /srv on thumbor1002 with size 342G [08:34:18] and that's only the thumbnail :-) [08:34:44] lol [08:35:50] odd, I'll take a look elukey [08:36:14] godog: truncated to 300GB [08:36:15] RECOVERY - Disk space on thumbor1002 is OK: DISK OK [08:36:20] just to free some space [08:36:48] elukey: where was the file? [08:37:10] /srv/log/thumbor/thumbor.error.log.1 [08:37:24] it is still 300GB [08:37:44] maybe it is spamming some error tons of times [08:37:59] it is yeah [08:40:14] elukey: thanks I'll take a look at the error, looks like still too many open files [08:41:30] let me know if you need some help... [08:41:32] let me know if you need some help.... [08:41:33] let me know if you need some help... [08:41:35] let me know if you need some help... [08:41:40] :) [08:41:55] lolz [08:42:01] I see what you did there [08:44:29] is cp1052 down for some reason? [08:45:01] yes probably, 2 days ago [08:45:23] but no downtime [08:45:41] elukey: https://phabricator.wikimedia.org/T148891 [08:45:57] hello! [08:46:02] I just found it, super [08:46:03] I've tried to bring it back up on Friday to see how long it would last [08:46:07] not long clearly :) [08:46:10] hahaha [08:54:01] (03PS1) 10Jdrewniak: Updating portal stats [mediawiki-config] - 10https://gerrit.wikimedia.org/r/337366 (https://phabricator.wikimedia.org/T128546) [08:58:00] (03CR) 10Nikerabbit: "According to WikiApiary, this Wikipedia is the only wiki using this language, so this makes sense." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/337192 (https://phabricator.wikimedia.org/T157846) (owner: 10MarcoAurelio) [09:00:45] !log Deploy alter table s3 officewiki and mediawikiwiki for echo_notification tables on eqiad - T136428 [09:00:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:00:51] T136428: Add primary key to echo_notification table - https://phabricator.wikimedia.org/T136428 [09:04:17] (03PS1) 10Marostegui: db-eqiad.php: Depool db1077 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/337368 (https://phabricator.wikimedia.org/T136428) [09:04:57] 06Operations, 10ops-eqiad, 10Traffic: cp1052 ethernet link down 2016-10-22 14:11 - https://phabricator.wikimedia.org/T148891#3021005 (10ema) That didn't last long: ``` [Sat Feb 11 03:37:44 2017] bnx2x 0000:01:00.0 eth0: NIC Link is Down ``` Also worth mentioning, /var/log/mcelog contains multiple instances... [09:06:41] 06Operations, 06Analytics-Kanban, 10Traffic, 13Patch-For-Review: Add global last-access cookie for top domain (*.wikipedia.org) - https://phabricator.wikimedia.org/T138027#3021008 (10ema) The patch is ready @Nuria, please let me know when it's OK to merge! [09:06:55] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1077 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/337368 (https://phabricator.wikimedia.org/T136428) (owner: 10Marostegui) [09:08:31] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1077 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/337368 (https://phabricator.wikimedia.org/T136428) (owner: 10Marostegui) [09:08:40] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1077 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/337368 (https://phabricator.wikimedia.org/T136428) (owner: 10Marostegui) [09:08:56] PROBLEM - puppet last run on mw1268 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Service[hhvm] [09:08:59] (03CR) 10Ema: VCL: Add support for WMF-Last-Access-Global analytics cookie (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/336790 (https://phabricator.wikimedia.org/T138027) (owner: 10Ema) [09:09:46] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1077 - T136428 (duration: 00m 40s) [09:09:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:09:53] T136428: Add primary key to echo_notification table - https://phabricator.wikimedia.org/T136428 [09:10:48] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1077" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/337369 [09:11:55] RECOVERY - puppet last run on mw1268 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [09:13:55] PROBLEM - puppet last run on californium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:15:14] Would one of you lovely operations people be able to take a quick look at https://phabricator.wikimedia.org/T157483 and confirm if I'm correct in assuming that it's probably why my key is getting refused? [09:16:02] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1077" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/337369 (owner: 10Marostegui) [09:16:46] 06Operations, 06Analytics-Kanban, 10netops: Review ACLs for the Analytics VLAN - https://phabricator.wikimedia.org/T157435#3021058 (10elukey) Remove elastic1001 -> 1016: ``` delete firewall family inet filter analytics-in4 term es from destination-address 10.64.0.108/32 delete firewall family inet filter an... [09:17:49] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1077" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/337369 (owner: 10Marostegui) [09:17:57] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1077" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/337369 (owner: 10Marostegui) [09:18:45] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1077 - T136428 (duration: 00m 40s) [09:18:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:18:50] T136428: Add primary key to echo_notification table - https://phabricator.wikimedia.org/T136428 [09:20:31] (03PS1) 10Marostegui: db-eqiad.php: Depool db1078 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/337371 (https://phabricator.wikimedia.org/T136428) [09:22:16] PROBLEM - HHVM jobrunner on mw1164 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.001 second response time [09:22:26] mmmm [09:22:35] PROBLEM - HHVM jobrunner on mw1162 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.001 second response time [09:22:37] I bet it is the locking issue [09:22:42] two of them? [09:23:03] or maybe it is moritzm upgrading [09:23:15] RECOVERY - HHVM jobrunner on mw1164 is OK: HTTP OK: HTTP/1.1 200 OK - 203 bytes in 0.002 second response time [09:23:35] RECOVERY - HHVM jobrunner on mw1162 is OK: HTTP OK: HTTP/1.1 200 OK - 203 bytes in 0.002 second response time [09:24:54] yeah, I've been upgrading these, they were depooled though [09:25:12] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1078 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/337371 (https://phabricator.wikimedia.org/T136428) (owner: 10Marostegui) [09:25:47] moritzm: super, my coffee level is super low and I am writing silly things this morning, don't pay attention to me :) [09:26:48] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1078 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/337371 (https://phabricator.wikimedia.org/T136428) (owner: 10Marostegui) [09:26:56] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1078 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/337371 (https://phabricator.wikimedia.org/T136428) (owner: 10Marostegui) [09:28:03] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1078 - T136428 (duration: 00m 40s) [09:28:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:28:07] T136428: Add primary key to echo_notification table - https://phabricator.wikimedia.org/T136428 [09:28:49] (03CR) 10Hashar: [C: 031] "Puppet compiler: https://puppet-compiler.wmflabs.org/5424/" [puppet] - 10https://gerrit.wikimedia.org/r/337286 (owner: 10Hashar) [09:29:28] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1078" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/337372 [09:32:00] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1078" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/337372 (owner: 10Marostegui) [09:32:45] PROBLEM - HHVM jobrunner on mw1166 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.001 second response time [09:33:24] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1078" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/337372 (owner: 10Marostegui) [09:33:36] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1078" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/337372 (owner: 10Marostegui) [09:33:45] RECOVERY - HHVM jobrunner on mw1166 is OK: HTTP OK: HTTP/1.1 200 OK - 203 bytes in 0.001 second response time [09:34:24] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1078 - T136428 (duration: 00m 41s) [09:34:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:34:29] T136428: Add primary key to echo_notification table - https://phabricator.wikimedia.org/T136428 [09:34:55] PROBLEM - puppet last run on mw1167 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 2 minutes ago with 2 failures. Failed resources (up to 3 shown): Package[hhvm],Package[hhvm-dbg] [09:35:15] PROBLEM - HHVM jobrunner on mw1299 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.001 second response time [09:36:15] RECOVERY - HHVM jobrunner on mw1299 is OK: HTTP OK: HTTP/1.1 200 OK - 203 bytes in 0.003 second response time [09:36:36] (03PS2) 10Hashar: jenkins: merge user/group sub classes [puppet] - 10https://gerrit.wikimedia.org/r/337287 [09:36:55] RECOVERY - puppet last run on mw1167 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [09:38:14] (03CR) 10Hashar: [C: 031] "Change is a no-op https://puppet-compiler.wmflabs.org/5427/" [puppet] - 10https://gerrit.wikimedia.org/r/337287 (owner: 10Hashar) [09:41:15] PROBLEM - HHVM jobrunner on mw1303 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.001 second response time [09:41:35] PROBLEM - HHVM jobrunner on mw1301 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.001 second response time [09:41:56] RECOVERY - puppet last run on californium is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [09:41:56] PROBLEM - HHVM jobrunner on mw1300 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.001 second response time [09:42:15] RECOVERY - HHVM jobrunner on mw1303 is OK: HTTP OK: HTTP/1.1 200 OK - 203 bytes in 0.001 second response time [09:42:35] RECOVERY - HHVM jobrunner on mw1301 is OK: HTTP OK: HTTP/1.1 200 OK - 203 bytes in 0.001 second response time [09:42:55] RECOVERY - HHVM jobrunner on mw1300 is OK: HTTP OK: HTTP/1.1 200 OK - 203 bytes in 0.001 second response time [09:48:51] (03CR) 10Hashar: [C: 031] "Added inline comment to help review." (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/337289 (owner: 10Hashar) [09:51:45] PROBLEM - HHVM jobrunner on mw1305 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.001 second response time [09:52:45] RECOVERY - HHVM jobrunner on mw1305 is OK: HTTP OK: HTTP/1.1 200 OK - 203 bytes in 0.002 second response time [09:53:21] (03PS2) 10Hashar: jenkins: support variable prefix setting [puppet] - 10https://gerrit.wikimedia.org/r/337307 [09:55:11] (03PS3) 10Hashar: jenkins: support variable prefix setting [puppet] - 10https://gerrit.wikimedia.org/r/337307 [09:58:11] (03CR) 10Hashar: [C: 04-1] "It is not a noop https://puppet-compiler.wmflabs.org/5430/contint1001.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/337307 (owner: 10Hashar) [10:10:47] (03CR) 10Hashar: [C: 031] "I have manually done the diff of the expanded template against the file modules/jenkins/files/etc_default_jenkins from parent change https" [puppet] - 10https://gerrit.wikimedia.org/r/337307 (owner: 10Hashar) [10:14:14] (03PS2) 10Ema: varnish: remove ganglia vhtcpd python module [puppet] - 10https://gerrit.wikimedia.org/r/337001 [10:14:21] (03CR) 10Ema: [V: 032 C: 032] varnish: remove ganglia vhtcpd python module [puppet] - 10https://gerrit.wikimedia.org/r/337001 (owner: 10Ema) [10:16:14] 06Operations, 06Labs, 10Tool-Labs, 10Traffic, 07HTTPS: Migrate tools.wmflabs.org to https only (and set HSTS) - https://phabricator.wikimedia.org/T102367#1363321 (10scfc) >>! In T102367#2891340, @bd808 wrote: > I really think we could just flip the switch at the ingress proxy and then deal with the fallo... [10:21:48] 06Operations, 05DC-Switchover-Prep-Q3-2016-17, 07Epic, 07Wikimedia-Multiple-active-datacenters: Check the size of every cluster in codfw to see if it matches eqiad's capacity - https://phabricator.wikimedia.org/T156023#3021239 (10elukey) Difference between eqiad and codfw counts from Prometheus dashboard:... [10:23:55] PROBLEM - puppet last run on uranium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:28:55] PROBLEM - Apache HTTP on mw1276 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:29:45] RECOVERY - Apache HTTP on mw1276 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 613 bytes in 0.036 second response time [10:31:43] (03PS1) 10Hashar: jenkins: support umask via service default [puppet] - 10https://gerrit.wikimedia.org/r/337377 [10:32:51] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Request for access to stat1003 for Sam Tarling - https://phabricator.wikimedia.org/T157483#3021266 (10Samtar) a:03RobH [10:35:10] !log gehel@puppetmaster1001 conftool action : set/pooled=yes; selector: name=wdqs1001.eqiad.wmnet [10:35:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:36:04] 06Operations, 06Discovery, 10Wikidata, 10Wikidata-Query-Service, and 2 others: Move data storage to /srv/wdqs/ on codfw WDQS nodes - https://phabricator.wikimedia.org/T144536#3021272 (10Gehel) wdqs1001 has catched up on updates and is now repooled. [10:37:09] (03CR) 10ArielGlenn: "Why not just include the css file by reference in the download-index, dvd and legal html files as you do for the rest?" [puppet] - 10https://gerrit.wikimedia.org/r/337264 (https://phabricator.wikimedia.org/T155697) (owner: 10Ladsgroup) [10:40:25] PROBLEM - HHVM rendering on mw1284 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:40:29] (03PS1) 10Gehel: elasticsearch - reimage to jessie and move data to /srv - preliminary work [puppet] - 10https://gerrit.wikimedia.org/r/337378 (https://phabricator.wikimedia.org/T151326) [10:41:15] RECOVERY - HHVM rendering on mw1284 is OK: HTTP OK: HTTP/1.1 200 OK - 76541 bytes in 0.109 second response time [10:41:42] (03PS2) 10Gehel: elasticsearch - reimage to jessie and move data to /srv - preliminary work [puppet] - 10https://gerrit.wikimedia.org/r/337378 (https://phabricator.wikimedia.org/T151326) [10:42:41] (03PS1) 10Muehlenhoff: Update SSH for Sam Tarling [puppet] - 10https://gerrit.wikimedia.org/r/337379 (https://phabricator.wikimedia.org/T157483) [10:47:05] !log remove big/spammy log files from thubmro100[12] - T157949 [10:47:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:47:09] T157949: Thumbor leaks pipes - https://phabricator.wikimedia.org/T157949 [10:47:21] (03CR) 10Muehlenhoff: [C: 032] Update SSH for Sam Tarling [puppet] - 10https://gerrit.wikimedia.org/r/337379 (https://phabricator.wikimedia.org/T157483) (owner: 10Muehlenhoff) [10:47:34] (03CR) 10Ladsgroup: "These are done in another folder. So I thought it's better to do it in another patch. I'm okay if you think we should do it in one patch." [puppet] - 10https://gerrit.wikimedia.org/r/337264 (https://phabricator.wikimedia.org/T155697) (owner: 10Ladsgroup) [10:48:20] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Request for access to stat1003 for Sam Tarling - https://phabricator.wikimedia.org/T157483#3021341 (10MoritzMuehlenhoff) @Samtar : There was a copy&paste error in the patch to add your key, please try again. [10:50:34] 06Operations, 10Traffic, 10Wikimedia-General-or-Unknown: Disable caching on the main page for anonymous users - https://phabricator.wikimedia.org/T119366#1824561 (10Catrope) Even English Wikipedia's main page uses `Wikipedia:Today's featured article/{{#time:F j, Y}}`, it's a very common trick. Using time-dep... [10:52:05] RECOVERY - puppet last run on uranium is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [10:52:41] (03PS2) 10Gehel: wdqs1002 - move data to /srv/wdqs to follow the usual partitioning scheme [puppet] - 10https://gerrit.wikimedia.org/r/336226 (https://phabricator.wikimedia.org/T144536) [10:55:55] PROBLEM - Nginx local proxy to apache on mw1289 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:56:45] RECOVERY - Nginx local proxy to apache on mw1289 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 0.043 second response time [10:57:02] 06Operations, 10Traffic, 10Wikimedia-General-or-Unknown: Disable caching on the main page for anonymous users - https://phabricator.wikimedia.org/T119366#3021378 (10Catrope) I went and viewed a few of these in incognito (around 10:50 UTC on Feb 13): trwiki: ``` NewPP limit report Parsed by mw1174 Cached tim... [10:59:47] (03PS6) 10Gehel: WDQS - move metric collection to diamond [puppet] - 10https://gerrit.wikimedia.org/r/335646 (https://phabricator.wikimedia.org/T146468) [11:00:05] PROBLEM - puppet last run on mw1191 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:02:43] (03PS4) 10Filippo Giunchedi: add prometheus1003/1004 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/336354 (https://phabricator.wikimedia.org/T152504) (owner: 10Dzahn) [11:10:09] (03PS7) 10Gehel: WDQS - move metric collection to diamond [puppet] - 10https://gerrit.wikimedia.org/r/335646 (https://phabricator.wikimedia.org/T146468) [11:11:04] (03CR) 10Filippo Giunchedi: [C: 032] add prometheus1003/1004 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/336354 (https://phabricator.wikimedia.org/T152504) (owner: 10Dzahn) [11:11:13] !log gehel@puppetmaster1001 conftool action : set/pooled=no; selector: name=wdqs1002.eqiad.wmnet [11:11:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:11:38] (03CR) 10Gehel: [C: 032] wdqs1002 - move data to /srv/wdqs to follow the usual partitioning scheme [puppet] - 10https://gerrit.wikimedia.org/r/336226 (https://phabricator.wikimedia.org/T144536) (owner: 10Gehel) [11:11:49] (03PS3) 10Gehel: wdqs1002 - move data to /srv/wdqs to follow the usual partitioning scheme [puppet] - 10https://gerrit.wikimedia.org/r/336226 (https://phabricator.wikimedia.org/T144536) [11:11:55] moritzm: merged your change too [11:14:45] PROBLEM - Check systemd state on planet1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:14:45] PROBLEM - Check systemd state on mw1266 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:14:45] PROBLEM - Check systemd state on db1088 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:14:45] PROBLEM - Check systemd state on ganeti1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:14:45] PROBLEM - Check systemd state on mw1271 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:14:55] PROBLEM - Check systemd state on db1033 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:14:55] PROBLEM - Check systemd state on mw1201 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:15:05] PROBLEM - Check systemd state on logstash1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:15:05] PROBLEM - Check systemd state on kubernetes1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:15:13] 06Operations, 06Discovery, 10Wikidata, 10Wikidata-Query-Service, and 2 others: Move data storage to /srv/wdqs/ on codfw WDQS nodes - https://phabricator.wikimedia.org/T144536#3021494 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by gehel on neodymium.eqiad.wmnet for hosts: ``` ['wdqs1002.eqia... [11:15:35] PROBLEM - Check systemd state on kafka1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:15:35] PROBLEM - Check systemd state on db1066 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:15:35] PROBLEM - Check systemd state on mw1263 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:15:35] PROBLEM - Check systemd state on argon is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:15:45] PROBLEM - Check systemd state on mw1254 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:15:45] PROBLEM - Check systemd state on db1022 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:15:45] PROBLEM - Check systemd state on bast1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:15:45] PROBLEM - Check systemd state on relforge1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:16:35] PROBLEM - Check systemd state on copper is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:16:35] PROBLEM - Check systemd state on mw1241 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:16:35] PROBLEM - Check systemd state on mw1189 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:16:36] PROBLEM - Check systemd state on mw1204 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:16:45] PROBLEM - Check systemd state on mw1294 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:16:45] PROBLEM - Check systemd state on mw1173 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:16:45] PROBLEM - Check systemd state on ganeti1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:16:45] PROBLEM - Check systemd state on mw1300 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:16:46] PROBLEM - Check systemd state on mw1166 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:16:46] PROBLEM - Check systemd state on thumbor1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:16:46] I think that might be me [11:16:52] checking [11:16:56] godog: something related to ferm... [11:17:05] PROBLEM - Check systemd state on auth1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:17:05] PROBLEM - Check systemd state on es1014 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:17:15] PROBLEM - Check systemd state on mw1235 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:17:18] yeah on kafka1001 is ferm [11:17:19] I'm checking my change as well, but I don't think it is related... [11:17:35] PROBLEM - Check systemd state on mw1205 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:17:35] PROBLEM - Check systemd state on mendelevium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:17:36] PROBLEM - Check systemd state on mw1253 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:17:36] PROBLEM - Check systemd state on etherpad1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:17:36] PROBLEM - Check systemd state on mw1267 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:17:36] PROBLEM - Check systemd state on db1040 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:17:41] heh yeah it was my change, rolling back now [11:17:45] PROBLEM - Check systemd state on oresrdb1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:17:45] PROBLEM - Check systemd state on mw1285 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:17:46] PROBLEM - Check systemd state on wtp1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:17:46] PROBLEM - Check systemd state on mw1298 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:18:00] godog: shutting down ircecho [11:18:05] PROBLEM - Check systemd state on mw1213 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:18:45] !log stopped ircecho on einsteinium [11:18:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:20:06] elukey: thanks! [11:21:14] I'll actually fix the root cause [11:21:47] !log installing PHP security updates [11:21:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:23:30] (03PS1) 10Filippo Giunchedi: wmnet: add ipv6 for prometheus100[34] [dns] - 10https://gerrit.wikimedia.org/r/337381 (https://phabricator.wikimedia.org/T152504) [11:26:10] (03CR) 10Filippo Giunchedi: [C: 032] wmnet: add ipv6 for prometheus100[34] [dns] - 10https://gerrit.wikimedia.org/r/337381 (https://phabricator.wikimedia.org/T152504) (owner: 10Filippo Giunchedi) [11:27:03] PROBLEM - Check systemd state on wtp1016 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:27:03] PROBLEM - Check systemd state on db1059 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:27:03] PROBLEM - Check systemd state on mw1284 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:27:03] PROBLEM - Check systemd state on wtp1010 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:27:03] PROBLEM - Check systemd state on wtp1008 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:27:14] (03CR) 10Volans: "LGTM, just minor details inline. (non blocking)" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/335646 (https://phabricator.wikimedia.org/T146468) (owner: 10Gehel) [11:27:33] PROBLEM - Check systemd state on chromium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:27:33] PROBLEM - Check systemd state on scb1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:27:33] PROBLEM - Check systemd state on etcd1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:27:34] PROBLEM - Check systemd state on db1086 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:27:43] PROBLEM - salt-minion processes on prometheus1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:27:43] PROBLEM - Disk space on prometheus1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:27:43] PROBLEM - dhclient process on prometheus1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:27:43] PROBLEM - Check systemd state on db1067 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:27:43] PROBLEM - Check systemd state on maps1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:27:43] PROBLEM - Check systemd state on mw1199 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:27:44] PROBLEM - Check systemd state on mw1276 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:27:53] PROBLEM - configured eth on prometheus1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:28:03] PROBLEM - Check systemd state on mw1172 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:28:03] PROBLEM - Check systemd state on db1018 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:28:03] PROBLEM - Check systemd state on mc1017 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:28:03] PROBLEM - Check systemd state on mw1170 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:28:03] PROBLEM - Check systemd state on druid1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:28:03] PROBLEM - Check systemd state on mw1177 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:28:04] PROBLEM - Check systemd state on mw1251 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:28:13] PROBLEM - Check systemd state on pc1006 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:28:13] PROBLEM - Check systemd state on prometheus1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:28:14] PROBLEM - puppet last run on prometheus1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:28:14] PROBLEM - DPKG on prometheus1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:28:22] I'll shut ircecho again [11:28:31] godog: ^^^ is this you? [11:28:33] PROBLEM - Check systemd state on restbase1007 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:28:33] PROBLEM - Check systemd state on puppetmaster1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:28:43] PROBLEM - Check systemd state on conf1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:28:43] PROBLEM - Check systemd state on kafka1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:28:44] PROBLEM - Check systemd state on mw1195 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:28:44] PROBLEM - Check systemd state on mwdebug1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:28:49] ah sorry, just read backlog [11:28:53] PROBLEM - Check systemd state on wtp1012 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:29:58] volans: yeah :( [11:30:44] sadly the next puppet run will succeed without actually reloading ferm [11:31:02] godog: sorry puppet re-enabled ircecho, I always forget : [11:31:04] :( [11:35:22] (03PS8) 10Gehel: WDQS - move metric collection to diamond [puppet] - 10https://gerrit.wikimedia.org/r/335646 (https://phabricator.wikimedia.org/T146468) [11:38:34] (03PS1) 10Hashar: jenkins: logrotate all log files [puppet] - 10https://gerrit.wikimedia.org/r/337383 [11:39:41] (03PS2) 10Hashar: jenkins: logrotate all log files [puppet] - 10https://gerrit.wikimedia.org/r/337383 [11:41:03] sigh ok even after adding the AAAAs it isn't going to fix by itself, I'll rollback the change so that ferm has a chance to reload for real [11:42:32] (03CR) 10ArielGlenn: "Yes, let's do them all at once, for a changeset like this that vastly simplifies the files. If it were a commit with a lot of small chang" [puppet] - 10https://gerrit.wikimedia.org/r/337264 (https://phabricator.wikimedia.org/T155697) (owner: 10Ladsgroup) [11:42:51] (03CR) 10Volans: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/335646 (https://phabricator.wikimedia.org/T146468) (owner: 10Gehel) [11:45:18] (03PS1) 10Filippo Giunchedi: hieradata: temporarily remove prometheus100[34] from prometheus_hosts [puppet] - 10https://gerrit.wikimedia.org/r/337384 (https://phabricator.wikimedia.org/T152504) [11:47:29] (03CR) 10Filippo Giunchedi: [C: 032] hieradata: temporarily remove prometheus100[34] from prometheus_hosts [puppet] - 10https://gerrit.wikimedia.org/r/337384 (https://phabricator.wikimedia.org/T152504) (owner: 10Filippo Giunchedi) [11:48:16] (03PS2) 10Ladsgroup: dumps: Centeralize CSS in one file, make it wider and apply to more files [puppet] - 10https://gerrit.wikimedia.org/r/337264 (https://phabricator.wikimedia.org/T155697) [11:49:58] 06Operations, 10ops-eqiad, 13Patch-For-Review: Suspected faulty SSD on graphite1001 - https://phabricator.wikimedia.org/T157022#3021558 (10fgiunchedi) >>! In T157022#3013111, @fgiunchedi wrote: >>>! In T157022#3002379, @MoritzMuehlenhoff wrote: >>>>! In T157022#2997068, @fgiunchedi wrote: >>> Read traffic ha... [11:51:49] (03PS1) 10Hashar: jenkins: allow access log to be flipped [puppet] - 10https://gerrit.wikimedia.org/r/337385 [11:53:06] 06Operations, 06Commons, 10MediaWiki-File-management, 06Multimedia, 05MW-1.28-release-notes: File does not thumbnail, doesn't have extracted metadata, has reported zero width/height (due to garbage bytes between JPEG sections) - https://phabricator.wikimedia.org/T148606#3021559 (10MoritzMuehlenhoff) @mat... [11:55:06] 06Operations, 15User-Urbanecm: Server refuses to save an edit - https://phabricator.wikimedia.org/T157950#3021562 (10Peachey88) [11:59:29] !log removing unneeded PHP packages from mw1261-mw1265 (these were installed before we changed puppet trim most PHP packages in favour of HHVM) [11:59:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:00:06] (03CR) 10Addshore: [C: 031] WDQS - move metric collection to diamond [puppet] - 10https://gerrit.wikimedia.org/r/335646 (https://phabricator.wikimedia.org/T146468) (owner: 10Gehel) [12:07:20] 06Operations, 06Discovery, 10Wikidata, 10Wikidata-Query-Service, and 2 others: Move data storage to /srv/wdqs/ on codfw WDQS nodes - https://phabricator.wikimedia.org/T144536#3021569 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['wdqs1002.eqiad.wmnet'] ``` and were **ALL** successful. [12:32:52] !log updating elastic search ACLs on cr1/cr2 for the analytics-ip4 filter [12:32:56] gehel: ---^ [12:32:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:33:07] 06Operations, 10Analytics, 10Analytics-Cluster: Move cloudera packages to a separate archive section - https://phabricator.wikimedia.org/T155726#3021593 (10MoritzMuehlenhoff) I also have a (slight) preference towards cdh. [12:33:10] elukey: kool! Thanks! [12:33:49] (03PS1) 10Filippo Giunchedi: coal: run on jessie [puppet] - 10https://gerrit.wikimedia.org/r/337386 (https://phabricator.wikimedia.org/T157022) [12:34:44] (03CR) 10jerkins-bot: [V: 04-1] coal: run on jessie [puppet] - 10https://gerrit.wikimedia.org/r/337386 (https://phabricator.wikimedia.org/T157022) (owner: 10Filippo Giunchedi) [12:36:37] RECOVERY - Check systemd state on einsteinium is OK: OK - running: The system is fully operational [12:38:30] (03PS1) 10Muehlenhoff: Add debdeploy salt grains for new dbmonitor hosts [puppet] - 10https://gerrit.wikimedia.org/r/337387 [12:39:17] (03PS2) 10Filippo Giunchedi: coal: run on jessie [puppet] - 10https://gerrit.wikimedia.org/r/337386 (https://phabricator.wikimedia.org/T157022) [12:51:56] (03PS2) 10Muehlenhoff: Add debdeploy salt grains for new dbmonitor hosts [puppet] - 10https://gerrit.wikimedia.org/r/337387 [12:56:17] PROBLEM - DPKG on phab2001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [12:57:17] RECOVERY - DPKG on phab2001 is OK: All packages OK [12:57:48] ^that was me during the php update [12:58:37] PROBLEM - Disk space on phab2001 is CRITICAL: DISK CRITICAL - /var/spool/exim4/scan is not accessible: Permission denied [12:59:37] RECOVERY - Disk space on phab2001 is OK: DISK OK [13:11:00] 06Operations, 10hardware-requests: spare ex4200s - check on quantity for potential shipment to OIT - https://phabricator.wikimedia.org/T157839#3021666 (10Cmjohnson) Currently there are 6 spare 4200's in eqiad. I believe @papaul has a few in codfw [13:11:47] PROBLEM - Disk space on ununpentium is CRITICAL: DISK CRITICAL - /var/spool/exim4/db is not accessible: Permission denied [13:13:43] 06Operations, 06Analytics-Kanban, 10netops: Review ACLs for the Analytics VLAN - https://phabricator.wikimedia.org/T157435#3021684 (10elukey) Fixed elastic IPs (not added annotations to analytics-in4). Next ones: 1) Remove udplog ? ``` term udplog { from { destination-address { 233... [13:14:16] 06Operations, 06Discovery, 10Elasticsearch, 10Wikimedia-Logstash, 06Discovery-Search (Current work): Update logstash on wikimedia to 5.x - https://phabricator.wikimedia.org/T154473#3021686 (10Aklapper) Does this task supersede T127676? [13:19:39] (03CR) 10Muehlenhoff: [C: 032] Add debdeploy salt grains for new dbmonitor hosts [puppet] - 10https://gerrit.wikimedia.org/r/337387 (owner: 10Muehlenhoff) [13:23:01] (03Abandoned) 10Ema: Allow ganglia to read VSM files [puppet] - 10https://gerrit.wikimedia.org/r/289696 (owner: 10Ema) [13:31:21] (03PS1) 10Hashar: jenkins: allow changing the web service TCP port [puppet] - 10https://gerrit.wikimedia.org/r/337388 [13:32:17] RECOVERY - Disk space on ununpentium is OK: DISK OK [13:32:30] (03CR) 10jerkins-bot: [V: 04-1] jenkins: allow changing the web service TCP port [puppet] - 10https://gerrit.wikimedia.org/r/337388 (owner: 10Hashar) [13:33:56] !log rolling restart of zookeeper in codfw to pick up Java security updates [13:34:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:34:15] (03PS2) 10Hashar: jenkins: allow changing the web service TCP port [puppet] - 10https://gerrit.wikimedia.org/r/337388 [13:37:19] !log lvs1004: upgrade to jessie 8.7, pybal 1.13.4, reboot into kernel 4.4.2-3+wmf8 T155401 [13:37:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:37:23] T155401: Integrate jessie 8.7 point release - https://phabricator.wikimedia.org/T155401 [13:47:47] !log lvs100[56]: upgrade to jessie 8.7, pybal 1.13.4, reboot into kernel 4.4.2-3+wmf8 T155401 [13:47:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:47:54] T155401: Integrate jessie 8.7 point release - https://phabricator.wikimedia.org/T155401 [13:48:38] zeljkof: I can swat today (as most of them are mine) :) [13:50:12] addshore: excelente! :) [13:50:16] please do [13:52:27] (03PS9) 10Gehel: WDQS - move metric collection to diamond [puppet] - 10https://gerrit.wikimedia.org/r/335646 (https://phabricator.wikimedia.org/T146468) [13:52:38] 06Operations, 10ops-eqiad, 06DC-Ops: Racktables equipment that should probably be renamed ? - https://phabricator.wikimedia.org/T150744#3021767 (10Cmjohnson) 05Open>03Resolved The system that Alex mentioned has been resolved and I am not aware of any other issues. [13:54:36] (03PS3) 10Addshore: Enable TwoColConflict on metawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/332910 (https://phabricator.wikimedia.org/T155721) [13:55:10] (03PS2) 10Addshore: Updating portal stats [mediawiki-config] - 10https://gerrit.wikimedia.org/r/337366 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [13:57:00] !log Shutdown db2060 for maintenance - T156161 [13:57:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:57:06] T156161: db2060 not accessible - https://phabricator.wikimedia.org/T156161 [13:57:59] !log rolling restart of zookeeper in eqiad to pick up Java security updates [13:58:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:59:27] 06Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1053 - https://phabricator.wikimedia.org/T151465#3021773 (10Marostegui) @Cmjohnson as per your update on T154587, let me know when you replace db1053's disk. Thanks! [14:00:04] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Dear anthropoid, the time has come. Please deploy European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170213T1400). [14:00:04] addshore and jan_drewniak: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [14:00:12] *waves toward jan_drewniak* [14:00:26] 06Operations, 10ops-codfw, 10DBA, 13Patch-For-Review: db2060 not accessible - https://phabricator.wikimedia.org/T156161#3021775 (10Marostegui) @Papaul db2060 is now off. Once you are done, please power it on and I will take care from there. Thanks! [14:01:08] 06Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1053 - https://phabricator.wikimedia.org/T151465#3021776 (10Cmjohnson) p:05Normal>03Triage Disks arrived....moving this up to do today. [14:02:08] Okay, starting with mine & then I'll move on from there. [14:02:27] 06Operations, 10ops-eqiad: Decommission old asw-c2-eqiad - https://phabricator.wikimedia.org/T156398#3021778 (10Cmjohnson) @robh are either of these under our service contract with juniper? The spare that failed stuck in loading (the spare) BP0208110780 The original that failed asw-c2-eqiad BP0211517641 [14:02:39] 06Operations, 10ops-eqiad: Decommission old asw-c2-eqiad - https://phabricator.wikimedia.org/T156398#3021780 (10Cmjohnson) a:05Cmjohnson>03RobH [14:04:34] addshore: I can help for the portal change, did it a few time already [14:04:39] there is some specific shell script to run under /portals/ [14:05:07] Okay, yeh I looked at it earlier, it looks like it does the scap sync-dir? do i guess it is just a case of get the change on the box and then run the script? [14:05:27] yup [14:05:48] ack, I'll ping you once I have done the other changes then! :) [14:10:22] Hi. [14:11:01] *twiddles thumbs waiting for CI to finish running extension gate submit jobs* [14:11:03] addshore: may I also add a config change? https://gerrit.wikimedia.org/r/#/c/337345/ [14:11:22] Dereckson: sure! [14:11:33] thanks [14:15:44] (03PS2) 10Milimetric: Enable Dashiki extension on meta.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336444 (https://phabricator.wikimedia.org/T156971) [14:18:27] !log addshore@tin Synchronized php-1.29.0-wmf.11/extensions/RevisionSlider/modules/ext.RevisionSlider.css: T157800 [[gerrit:337030|Dont set min-height and min-width for oo-ui buttons]] 1/2 (duration: 01m 07s) [14:18:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:18:31] T157800: RevisionSlider UI elements dimensions look odd - https://phabricator.wikimedia.org/T157800 [14:19:44] !log addshore@tin Synchronized php-1.29.0-wmf.11/extensions/RevisionSlider/modules/ext.RevisionSlider.lazy.css: T157800 [[gerrit:337030|Dont set min-height and min-width for oo-ui buttons]] 2/2 (duration: 00m 55s) [14:19:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:20:08] (03PS1) 10Ladsgroup: gerrit: Make blue buttons look like OOUI [puppet] - 10https://gerrit.wikimedia.org/r/337397 [14:23:21] !log addshore@tin Synchronized php-1.29.0-wmf.11/extensions/TwoColConflict/includes/TwoColConflictHooks.php: [[gerrit:337186|Change beta feature info and talk links (duration: 00m 40s) [14:23:23] 06Operations, 10Deployment-Systems, 06Release-Engineering-Team, 10Scap, 15User-Addshore: cannot delete non-empty directory: php-1.29.0-wmf.3 messages on 'scap sync' on mwdebug1002 - https://phabricator.wikimedia.org/T157030#3021831 (10Addshore) 05Resolved>03Open I just saw this again today in EU swat. [14:23:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:23:39] (03CR) 10Addshore: [C: 032] Enable TwoColConflict on metawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/332910 (https://phabricator.wikimedia.org/T155721) (owner: 10Addshore) [14:25:42] (03Merged) 10jenkins-bot: Enable TwoColConflict on metawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/332910 (https://phabricator.wikimedia.org/T155721) (owner: 10Addshore) [14:26:54] (03CR) 10jenkins-bot: Enable TwoColConflict on metawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/332910 (https://phabricator.wikimedia.org/T155721) (owner: 10Addshore) [14:27:29] Hi addshore, I swat still on or finished? Sorry my irc client was turned off :/ [14:27:50] still going :) [14:28:01] jan_drewniak: Your up next (after this one finishes syncing) << hashar [14:28:10] o/ [14:28:17] (03PS3) 10Addshore: Updating portal stats [mediawiki-config] - 10https://gerrit.wikimedia.org/r/337366 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [14:28:26] !log addshore@tin Synchronized wmf-config/InitialiseSettings.php: T155721 [[gerrit:332910|Enable TwoColConflict on metawiki]] (duration: 00m 40s) [14:28:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:28:30] T155721: Deploy TwoColConflict extension to meta, dewiki and one RTL-wiki - https://phabricator.wikimedia.org/T155721 [14:28:52] portal time! [14:28:58] (03CR) 10Addshore: [C: 032] Updating portal stats [mediawiki-config] - 10https://gerrit.wikimedia.org/r/337366 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [14:29:16] 06Operations: systemd-timedated starting up every minute - https://phabricator.wikimedia.org/T157797#3021861 (10MoritzMuehlenhoff) That is in fact the Icinga running timedatectl, so this is expected. I'll reduce the execution time of the check, though. The check for ntpd runs every 30 mins, which is fine here as... [14:29:39] 06Operations: Run Icinga check for systemd-timedated less often - https://phabricator.wikimedia.org/T157797#3021862 (10MoritzMuehlenhoff) [14:29:43] 06Operations, 10Analytics, 10Analytics-Cluster: Zookeeper heap usage patterns - https://phabricator.wikimedia.org/T157968#3021877 (10elukey) [14:29:54] 06Operations, 10Analytics, 10Analytics-Cluster: Zookeeper heap usage patterns - https://phabricator.wikimedia.org/T157968#3021890 (10elukey) p:05Triage>03Normal [14:30:09] PROBLEM - Check systemd state on ms-fe1005 is CRITICAL: Return code of 255 is out of bounds [14:30:29] PROBLEM - DPKG on ms-fe1005 is CRITICAL: Return code of 255 is out of bounds [14:30:39] PROBLEM - Disk space on ms-fe1005 is CRITICAL: Return code of 255 is out of bounds [14:30:49] PROBLEM - MD RAID on ms-fe1005 is CRITICAL: Return code of 255 is out of bounds [14:31:20] jan_drewniak: hashar, I'm guessing the portals are testable on mwdebug1002 in some way? [14:31:30] PROBLEM - configured eth on ms-fe1005 is CRITICAL: Return code of 255 is out of bounds [14:31:32] (03Merged) 10jenkins-bot: Updating portal stats [mediawiki-config] - 10https://gerrit.wikimedia.org/r/337366 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [14:31:36] ms-fe1005 is me, silencing [14:31:38] (03PS3) 10Filippo Giunchedi: coal: run on jessie [puppet] - 10https://gerrit.wikimedia.org/r/337386 (https://phabricator.wikimedia.org/T157022) [14:31:41] (03CR) 10jenkins-bot: Updating portal stats [mediawiki-config] - 10https://gerrit.wikimedia.org/r/337366 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [14:31:49] PROBLEM - dhclient process on ms-fe1005 is CRITICAL: Return code of 255 is out of bounds [14:31:57] addshore: yup [14:31:59] PROBLEM - puppet last run on ms-fe1005 is CRITICAL: Return code of 255 is out of bounds [14:32:49] jan_drewniak: I believe it should now be on mwdebug1002 then! :) [14:34:11] 06Operations, 10Deployment-Systems, 06Release-Engineering-Team, 10Scap, 15User-Addshore: cannot delete non-empty directory: php-1.29.0-wmf.3 messages on 'scap sync' on mwdebug1002 - https://phabricator.wikimedia.org/T157030#3021897 (10Reedy) [14:34:39] (03PS4) 10Filippo Giunchedi: coal: run on jessie [puppet] - 10https://gerrit.wikimedia.org/r/337386 (https://phabricator.wikimedia.org/T157022) [14:35:22] addshore: yup, looks good [14:35:35] ack, time to run the sync script! [14:36:21] !log addshore@tin Synchronized portals/prod/wikipedia.org/assets: Updating portal stats [[gerrit:337366|Gerrit]] (duration: 00m 40s) [14:36:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:36:29] RECOVERY - DPKG on ms-fe1005 is OK: All packages OK [14:36:29] RECOVERY - configured eth on ms-fe1005 is OK: OK - interfaces up [14:36:33] (03CR) 10Filippo Giunchedi: [C: 032] coal: run on jessie [puppet] - 10https://gerrit.wikimedia.org/r/337386 (https://phabricator.wikimedia.org/T157022) (owner: 10Filippo Giunchedi) [14:36:39] RECOVERY - Disk space on ms-fe1005 is OK: DISK OK [14:36:49] RECOVERY - dhclient process on ms-fe1005 is OK: PROCS OK: 0 processes with command name dhclient [14:36:49] RECOVERY - MD RAID on ms-fe1005 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [14:37:02] !log addshore@tin Synchronized portals: Updating portal stats [[gerrit:337366|Gerrit]] (duration: 00m 40s) [14:37:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:37:09] RECOVERY - Check systemd state on ms-fe1005 is OK: OK - running: The system is fully operational [14:37:24] jan_drewniak: should be done! [14:37:35] (03PS2) 10Addshore: Update Wikiquote talk namespace in Sanskrit Wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/337345 (https://phabricator.wikimedia.org/T101634) (owner: 10Dereckson) [14:37:40] (03PS2) 10Hashar: jenkins: sync default file with upstream 1.651.3 [puppet] - 10https://gerrit.wikimedia.org/r/337289 [14:37:42] (03PS4) 10Hashar: jenkins: support variable prefix setting [puppet] - 10https://gerrit.wikimedia.org/r/337307 [14:37:44] (03PS2) 10Hashar: jenkins: support umask via service default [puppet] - 10https://gerrit.wikimedia.org/r/337377 [14:37:46] (03PS2) 10Hashar: jenkins: allow access log to be flipped [puppet] - 10https://gerrit.wikimedia.org/r/337385 [14:37:48] (03PS3) 10Hashar: jenkins: allow changing the web service TCP port [puppet] - 10https://gerrit.wikimedia.org/r/337388 [14:37:49] 06Operations, 10Ops-Access-Requests, 06Research-and-Data: Cluster Access for Nithum Thain - https://phabricator.wikimedia.org/T157724#3021901 (10Nithum) Just signed the NDA. Please let me know if there's anything else needed on my end. [14:37:50] (03PS1) 10Hashar: (DO NOT SUBMIT) octopus merge of Jenkins changes [puppet] - 10https://gerrit.wikimedia.org/r/337399 [14:37:53] Dereckson: doing yours now! [14:37:57] (03CR) 10Addshore: [C: 032] Update Wikiquote talk namespace in Sanskrit Wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/337345 (https://phabricator.wikimedia.org/T101634) (owner: 10Dereckson) [14:37:59] RECOVERY - puppet last run on ms-fe1005 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [14:40:00] 06Operations, 10Traffic, 10Wikimedia-General-or-Unknown: Disable caching on the main page for anonymous users - https://phabricator.wikimedia.org/T119366#1824561 (10Anomie) Some comments: * This seems like it may be a duplicate of {T51803}. * `{{#time:}}`'s TTL doesn't seem to be being propagated, so that 3... [14:40:08] (03Merged) 10jenkins-bot: Update Wikiquote talk namespace in Sanskrit Wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/337345 (https://phabricator.wikimedia.org/T101634) (owner: 10Dereckson) [14:40:20] (03CR) 10jenkins-bot: Update Wikiquote talk namespace in Sanskrit Wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/337345 (https://phabricator.wikimedia.org/T101634) (owner: 10Dereckson) [14:40:40] Dereckson: its on mwdebug1002 for you, please check! [14:41:27] (03PS1) 10Ema: MonitoringProtocol: do not crash with ValueError on unicode strings [debs/pybal] (1.13) - 10https://gerrit.wikimedia.org/r/337400 (https://phabricator.wikimedia.org/T147425) [14:41:48] thanks [14:46:14] Dereckson: how does it look? :) [14:48:10] I need to send a follow up change with the old namespace in alias, or we lost some pages with this config. [14:48:39] okay! I'll hold off on syncing [14:49:28] (03PS1) 10Muehlenhoff: Only run the timesynd_ntp_status Icinga check every 30 minutes [puppet] - 10https://gerrit.wikimedia.org/r/337401 [14:54:14] (03CR) 10Gehel: [C: 032] WDQS - move metric collection to diamond [puppet] - 10https://gerrit.wikimedia.org/r/335646 (https://phabricator.wikimedia.org/T146468) (owner: 10Gehel) [14:54:20] 06Operations, 10Analytics, 10Analytics-Cluster: Zookeeper heap usage patterns - https://phabricator.wikimedia.org/T157968#3021956 (10elukey) [14:54:23] (03PS10) 10Gehel: WDQS - move metric collection to diamond [puppet] - 10https://gerrit.wikimedia.org/r/335646 (https://phabricator.wikimedia.org/T146468) [14:54:39] (03CR) 10DCausse: [C: 031] "lgtm," [puppet] - 10https://gerrit.wikimedia.org/r/337378 (https://phabricator.wikimedia.org/T151326) (owner: 10Gehel) [14:54:39] gehel woo! [14:54:53] addshore: yeah, it took some time ... [14:55:23] Thanks for doing it, I still havn't found the time to look into the diamond stuff much yet [14:56:02] addshore: it is fairly easy, and relatively well documented. Ping me if you want details! [14:57:43] Dereckson: any progress? Coming to the end of the window ;) [14:58:02] (03CR) 10Gehel: "@dcausse: the full deployment process is slightly more complex, but yes, you got most of it. See https://phabricator.wikimedia.org/T151326" [puppet] - 10https://gerrit.wikimedia.org/r/337378 (https://phabricator.wikimedia.org/T151326) (owner: 10Gehel) [14:58:57] 06Operations, 13Patch-For-Review: Upgrade fluorine to trusty/jessie - https://phabricator.wikimedia.org/T123728#3021957 (10fgiunchedi) [14:59:40] addshore: yes I'm commiting [15:01:27] (03PS1) 10Dereckson: Support legacy Wikiquote talk namespace in Sanskrit Wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/337402 (https://phabricator.wikimedia.org/T101634) [15:01:29] addshore: here you are ^ [15:02:10] (03CR) 10Addshore: [C: 032] Support legacy Wikiquote talk namespace in Sanskrit Wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/337402 (https://phabricator.wikimedia.org/T101634) (owner: 10Dereckson) [15:02:24] (03CR) 10Faidon Liambotis: [C: 031] Initial import with the first version [software/cumin] - 10https://gerrit.wikimedia.org/r/330425 (https://phabricator.wikimedia.org/T154588) (owner: 10Volans) [15:03:47] (03Merged) 10jenkins-bot: Support legacy Wikiquote talk namespace in Sanskrit Wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/337402 (https://phabricator.wikimedia.org/T101634) (owner: 10Dereckson) [15:03:57] (03CR) 10jenkins-bot: Support legacy Wikiquote talk namespace in Sanskrit Wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/337402 (https://phabricator.wikimedia.org/T101634) (owner: 10Dereckson) [15:05:09] 'Dereckson on mwdebug1002! [15:05:16] Thanks, testing. [15:05:31] looks good for me [15:05:48] cool, syncing [15:06:35] !log addshore@tin Synchronized wmf-config/InitialiseSettings.php: T101634 [[gerrit:337345|Update Wikiquote talk namespace in Sanskrit Wikisource]] and [[gerrit:337402|Support legacy Wikiquote talk namespace in Sanskrit Wikisource]] (duration: 00m 40s) [15:06:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:06:39] T101634: Correction of namespace names in Sanskrit - https://phabricator.wikimedia.org/T101634 [15:06:57] !log EU SWAT done [15:07:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:07:54] 06Operations, 10Analytics, 10Analytics-Cluster: Zookeeper heap usage patterns - https://phabricator.wikimedia.org/T157968#3021963 (10elukey) [15:17:48] Thanks addshore for the deploy. [15:17:55] np :) [15:18:04] !log switched krypton to exim4-daemon-light (the -heavy variant was installed from an earlier role it carried) [15:18:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:18:41] so gehel, can we merge https://gerrit.wikimedia.org/r/#/c/336208/ now? [15:19:45] 06Operations, 10Deployment-Systems, 06Release-Engineering-Team, 10Scap, 15User-Addshore: cannot delete non-empty directory: php-1.29.0-wmf.3 messages on 'scap sync' on mwdebug1002 - https://phabricator.wikimedia.org/T157030#3021971 (10mmodell) these are generally caused by some files owned by l10nupdate... [15:20:19] addshore: not quite yet... I need to switch the icinga checks first, and I need some amount of history first... Or better, merge the history of this metrics. [15:20:24] gehel: although I guess https://github.com/wikimedia/operations-puppet/blob/production/modules/wdqs/manifests/monitor/services.pp#L42 should be updated too! [15:20:30] ack :) [15:20:34] I'll leave it all to you! :) [15:20:55] Will do! [15:22:42] 06Operations: Puppet fails only once when restarting ferm is not successful - https://phabricator.wikimedia.org/T157972#3021981 (10fgiunchedi) [15:22:59] moritzm: ^ [15:23:11] (03PS2) 10Giuseppe Lavagetto: stdlib: upgrade to 4.15.0 [puppet] - 10https://gerrit.wikimedia.org/r/336230 [15:23:36] <_joe_> I'm merging ^^ [15:24:53] <_joe_> today, that is [15:29:04] (03CR) 10DCausse: [C: 04-1] "I'd like some advices on the "more logic" vs "verbose service definition", I'm undecided (see comments in ProductionServices.php)" (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/335824 (https://phabricator.wikimedia.org/T132076) (owner: 10DCausse) [15:30:33] (03CR) 10jerkins-bot: [V: 04-1] stdlib: upgrade to 4.15.0 [puppet] - 10https://gerrit.wikimedia.org/r/336230 (owner: 10Giuseppe Lavagetto) [15:32:33] 06Operations, 13Patch-For-Review: Upgrade fluorine to trusty/jessie - https://phabricator.wikimedia.org/T123728#3021999 (10fgiunchedi) >>! In T123728#2993785, @bd808 wrote: >>>! In T123728#2992752, @fgiunchedi wrote: >> For redundancy purposes it would be nice if mediawiki could send udp2log traffic to udp2log... [15:36:19] !log lvs1001: upgrade to jessie 8.7, pybal 1.13.4, reboot into kernel 4.4.2-3+wmf8 T155401 [15:36:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:36:27] T155401: Integrate jessie 8.7 point release - https://phabricator.wikimedia.org/T155401 [15:37:13] addshore: of course, whisper-merge is broken... [15:37:23] bwhahaha [15:38:00] gehel: you could always just copy the ones from my script over the top of the ones generated by diamond already, that would essentially work as a merge [15:38:33] addshore: yeah, might be the best course of action... [15:38:33] (but losing the last 30 mins or so of new data from diamond) but then that would still be filled by the set from my script too! [15:39:29] PROBLEM - PyBal backends health check on lvs1001 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 [15:39:29] PROBLEM - pybal on lvs1001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal [15:39:55] that's me ^ [15:40:12] bad ema :-P [15:41:35] (03PS1) 10Hashar: jenkins: migrate to systemd [puppet] - 10https://gerrit.wikimedia.org/r/337404 [15:41:49] PROBLEM - DPKG on lvs1001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [15:41:58] (03CR) 10Hashar: [C: 04-1] "WIP, absolutely not tested." [puppet] - 10https://gerrit.wikimedia.org/r/337404 (owner: 10Hashar) [15:45:49] RECOVERY - DPKG on lvs1001 is OK: All packages OK [15:47:13] 06Operations, 06Analytics-Kanban, 10Traffic, 13Patch-For-Review: Add global last-access cookie for top domain (*.wikipedia.org) - https://phabricator.wikimedia.org/T138027#3022039 (10Nuria) Let's merge! thank you. [15:47:45] (03CR) 10jerkins-bot: [V: 04-1] jenkins: migrate to systemd [puppet] - 10https://gerrit.wikimedia.org/r/337404 (owner: 10Hashar) [15:50:20] 06Operations, 10ops-eqiad: decommission ms1003 - https://phabricator.wikimedia.org/T157975#3022054 (10ArielGlenn) [15:52:47] !log copy old blazegraph metrics to new path (wikidata.query.(triple|lag).* -> servers....) [15:52:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:55:39] (03PS2) 10Gehel: WDQS - move metric collection to diamond [puppet] - 10https://gerrit.wikimedia.org/r/336220 (https://phabricator.wikimedia.org/T146468) [15:57:32] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Request for access to stat1003 for Sam Tarling - https://phabricator.wikimedia.org/T157483#3022072 (10RobH) 05Open>03Resolved a:05RobH>03None [15:58:33] !log lvs1002: upgrade to jessie 8.7, pybal 1.13.4, reboot into kernel 4.4.2-3+wmf8 T155401 [15:58:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:58:38] T155401: Integrate jessie 8.7 point release - https://phabricator.wikimedia.org/T155401 [15:59:42] (03PS2) 10Faidon Liambotis: Remove jzerebecki from Icinga contact groups [puppet] - 10https://gerrit.wikimedia.org/r/337209 [16:00:13] (03CR) 10Faidon Liambotis: [V: 032 C: 032] Remove jzerebecki from Icinga contact groups [puppet] - 10https://gerrit.wikimedia.org/r/337209 (owner: 10Faidon Liambotis) [16:01:19] 06Operations, 10Ops-Access-Requests, 06Research-and-Data: Cluster Access for Nithum Thain - https://phabricator.wikimedia.org/T157724#3022080 (10RobH) We actually have to have legal confirm the NDA signatures. They should be able to update the task confirming shortly. [16:08:53] !log lvs1003: upgrade to jessie 8.7, pybal 1.13.4, reboot into kernel 4.4.2-3+wmf8 T155401 [16:08:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:08:58] T155401: Integrate jessie 8.7 point release - https://phabricator.wikimedia.org/T155401 [16:09:06] (03PS1) 10Hashar: systemd: allow isequal to match programname in/rsyslog [puppet] - 10https://gerrit.wikimedia.org/r/337411 [16:10:54] (03CR) 10Hashar: "The rsyslog templates matches the programname with starts with which is I believe a bit too permissive. The template got added for servic" [puppet] - 10https://gerrit.wikimedia.org/r/337411 (owner: 10Hashar) [16:16:46] (03PS1) 10Elukey: Add default JAVA_OPTS to the zookeeper server class [puppet/zookeeper] - 10https://gerrit.wikimedia.org/r/337413 (https://phabricator.wikimedia.org/T157968) [16:17:57] (03PS2) 10Elukey: Add default JAVA_OPTS to the zookeeper server class [puppet/zookeeper] - 10https://gerrit.wikimedia.org/r/337413 (https://phabricator.wikimedia.org/T157968) [16:18:19] (03CR) 10jerkins-bot: [V: 04-1] Add default JAVA_OPTS to the zookeeper server class [puppet/zookeeper] - 10https://gerrit.wikimedia.org/r/337413 (https://phabricator.wikimedia.org/T157968) (owner: 10Elukey) [16:18:25] (03PS3) 10Elukey: Add default JAVA_OPTS to the zookeeper server class [puppet/zookeeper] - 10https://gerrit.wikimedia.org/r/337413 (https://phabricator.wikimedia.org/T157968) [16:18:49] argh too many notifications sorry :) [16:19:44] (03PS4) 10Dzahn: CI: decom scandium [puppet] - 10https://gerrit.wikimedia.org/r/337041 (https://phabricator.wikimedia.org/T150936) [16:20:30] (03PS4) 10Elukey: Add default JAVA_OPTS to the zookeeper server class [puppet/zookeeper] - 10https://gerrit.wikimedia.org/r/337413 (https://phabricator.wikimedia.org/T157968) [16:20:35] (03CR) 10jerkins-bot: [V: 04-1] Add default JAVA_OPTS to the zookeeper server class [puppet/zookeeper] - 10https://gerrit.wikimedia.org/r/337413 (https://phabricator.wikimedia.org/T157968) (owner: 10Elukey) [16:20:54] (03CR) 10Gehel: Add default JAVA_OPTS to the zookeeper server class (031 comment) [puppet/zookeeper] - 10https://gerrit.wikimedia.org/r/337413 (https://phabricator.wikimedia.org/T157968) (owner: 10Elukey) [16:21:04] hashar: ^ how about scandium? still +1 and could be shutdown ? [16:21:39] gehel: thanks for the comment, I am doing too many things at the same time and sending silly things [16:22:02] :P missing coma are the hardest to spot yourself... [16:31:56] (03PS3) 10Gehel: WDQS - move metric collection to diamond [puppet] - 10https://gerrit.wikimedia.org/r/336220 (https://phabricator.wikimedia.org/T146468) [16:33:09] (03CR) 10Gehel: [C: 032] WDQS - move metric collection to diamond [puppet] - 10https://gerrit.wikimedia.org/r/336220 (https://phabricator.wikimedia.org/T146468) (owner: 10Gehel) [16:33:33] mutante: guten tag [16:33:38] mutante: Ja. Scandium can be removed [16:38:34] hashar: ouf! Je kiffe. (i tried) :) [16:38:39] hashar: on it [16:39:20] (03CR) 10Dzahn: [C: 032] CI: decom scandium [puppet] - 10https://gerrit.wikimedia.org/r/337041 (https://phabricator.wikimedia.org/T150936) (owner: 10Dzahn) [16:39:33] (03PS5) 10Dzahn: CI: decom scandium [puppet] - 10https://gerrit.wikimedia.org/r/337041 (https://phabricator.wikimedia.org/T150936) [16:41:10] (03CR) 10Dzahn: "aah! yep, on the other hand it also makes it easy to first run puppet and get the mapped IP on the interface and then knowing it's the rig" [puppet] - 10https://gerrit.wikimedia.org/r/337384 (https://phabricator.wikimedia.org/T152504) (owner: 10Filippo Giunchedi) [16:44:28] (03CR) 10Mobrovac: "One nit in-lined, otherwise LGTM" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/337411 (owner: 10Hashar) [16:45:31] 06Operations, 10hardware-requests: spare ex4200s - check on quantity for potential shipment to OIT - https://phabricator.wikimedia.org/T157839#3022223 (10Papaul) i have 2 onsite [16:46:21] addshore: the changes to wdqs monitoring are deployed and working. Do you want me to merge your 2 patches on 336220 ? [16:46:31] sorry, on analytics/wmde/scripts [16:46:51] 06Operations, 10Ops-Access-Requests, 06Research-and-Data: Cluster Access for Nithum Thain - https://phabricator.wikimedia.org/T157724#3022224 (10RStallman-legalteam) Yes, the NDA is fully executed and on file. Thanks everyone! [16:50:40] (03PS1) 10Dzahn: prometheus: add AAAA for 1003/1004, v6 PTR for all [dns] - 10https://gerrit.wikimedia.org/r/337422 (https://phabricator.wikimedia.org/T154504) [16:51:12] (03PS1) 10DCausse: [cirrus] properly set wgCirrusSearchUseIcuFolding [mediawiki-config] - 10https://gerrit.wikimedia.org/r/337423 (https://phabricator.wikimedia.org/T155515) [16:51:47] (03PS2) 10DCausse: [cirrus] properly set wgCirrusSearchUseIcuFolding [mediawiki-config] - 10https://gerrit.wikimedia.org/r/337423 (https://phabricator.wikimedia.org/T155515) [16:52:08] (03PS2) 10Dzahn: prometheus: add v6 reverse records [dns] - 10https://gerrit.wikimedia.org/r/337422 (https://phabricator.wikimedia.org/T154504) [16:55:31] (03PS7) 10Madhuvishy: nfs: Snapshot backup device on secondary DC before replicating latest from remote [puppet] - 10https://gerrit.wikimedia.org/r/334692 (https://phabricator.wikimedia.org/T149870) [17:00:04] godog, moritzm, and _joe_: Respected human, time to deploy Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170213T1700). Please do the needful. [17:00:04] hashar: A patch you scheduled for Puppet SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [17:02:09] PROBLEM - puppet last run on labservices1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:04:57] (03CR) 10EBernhardson: Enable Translation memories multi-DC support (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/335824 (https://phabricator.wikimedia.org/T132076) (owner: 10DCausse) [17:06:35] 06Operations, 10Ops-Access-Requests, 06Research-and-Data: Cluster Access for Nithum Thain - https://phabricator.wikimedia.org/T157724#3022286 (10RobH) a:03RobH Thanks! I'm claiming this to push my patch and merge at the end of the 3 day wait - pending no objections. Just to summarize: [x] - new login de... [17:09:40] o/ [17:09:57] 06Operations, 10hardware-requests: spare ex4200s - check on quantity for potential shipment to OIT - https://phabricator.wikimedia.org/T157839#3022294 (10RobH) We need to keep some on-site in each location. I'm not sure it is worth touching @papaul's two spares. It may be worthwhile to ship half the spares f... [17:10:07] 06Operations, 10hardware-requests: spare ex4200s - check on quantity for potential shipment to OIT - https://phabricator.wikimedia.org/T157839#3022296 (10RobH) a:05Cmjohnson>03None [17:10:11] godog: moritzm: _joe_: got a bunch of not so interesting patches for puppet swat :) [17:10:18] 06Operations, 10hardware-requests: spare ex4200s - check on quantity for potential shipment to OIT - https://phabricator.wikimedia.org/T157839#3018327 (10RobH) a:03RobH [17:10:38] godog: moritzm: _joe_: or is that the time slot for your team meeting? [17:12:24] hashar: yes, it is the meeting [17:12:24] hashar: it is, puppetswat is tue/thur usually, we're indeed in the meeting [17:12:52] so ignore my patches :-} the wikitech Deployments was mysteriously adding a puppet swat slot for today :-} [17:24:54] (03CR) 10Gehel: "@robh: your review is most welcomed on the partman recipe side in particular..." [puppet] - 10https://gerrit.wikimedia.org/r/337378 (https://phabricator.wikimedia.org/T151326) (owner: 10Gehel) [17:25:09] PROBLEM - puppet last run on iridium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:25:58] ^ this keeps happening periodically... anyone care to look at the puppetmaster log and see what's failing? [17:27:22] (03CR) 10Reedy: [C: 031] Enable Echo on loginwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336548 (https://phabricator.wikimedia.org/T157105) (owner: 10Kaldari) [17:28:09] RECOVERY - puppet last run on iridium is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [17:29:50] twentyafterfour: ^ works on next run. the error is "Attempt to assign to a reserved variable name: 'trusted' on node iridium.eqiad.wmnet [17:30:04] weird [17:30:12] yea [17:30:23] * twentyafterfour greps for 'trusted' [17:30:24] we saw that in compiler runs before [17:30:35] but not related to iridium [17:30:39] afair. .. in meeting though [17:31:46] thanks! [17:32:09] RECOVERY - puppet last run on labservices1002 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [17:32:51] twentyafterfour: https://phabricator.wikimedia.org/T153246 [17:33:05] same error, different servers [17:33:20] so probably not unique to phabricator role [17:33:28] yeah indeed [17:33:44] but more a general problem when the master or client are under load or something [17:34:22] the "trusted" thing is probably more or less random effect of "could not get catalog from master" [17:34:48] the linked upstream bug implicates fact cache [17:34:56] or rather puppetdb and fact cache masks the error [17:35:04] ah [17:35:11] i didnt even notice that yet [17:36:10] 06Operations, 07Puppet: Puppet failures with "Attempt to assign to a reserved variable name: 'trusted'" - https://phabricator.wikimedia.org/T153246#2873968 (10Dzahn) this is now happening on iridium ``` @puppetmaster1001:/var/log# grep trusted /var/log/syslog Feb 13 17:22:13 puppetmaster1001 puppet-master[17... [17:47:03] _joe_: hey, around? [17:47:11] <_joe_> Amir1: in a meeting [17:47:35] okay, I put it here, read it when you can. Thanks [17:47:35] 06Operations, 10ops-codfw, 10DBA, 13Patch-For-Review: db2060 not accessible - https://phabricator.wikimedia.org/T156161#3022399 (10Papaul) a:05Papaul>03Marostegui Main board replacement complete. [17:48:09] 06Operations, 10ops-codfw, 06DC-Ops, 10hardware-requests, 13Patch-For-Review: decom install2001 - https://phabricator.wikimedia.org/T157840#3022406 (10Papaul) Disk wipe in progress [17:49:45] We are talking with reading team to bring back that api.php functionality with some modifications for stricter limit on batching but that users is still sending lots of requests (320,000 per day) and the UA doesn't meet the UA policy so we can't contact them and ask to slow down. So we were thinking if it's okay to issue a varnish ban before we can bring [17:49:46] this back up? [17:49:55] 06Operations, 07Puppet: Puppet failures with "Attempt to assign to a reserved variable name: 'trusted'" - https://phabricator.wikimedia.org/T153246#2873968 (10Paladox) Hi, this is related https://ask.puppet.com/question/17184/attempt-to-assign-to-a-reserved-variable-name-trusted/ [17:52:23] Reedy: Thanks for the +1~ [17:54:01] 06Operations, 07Puppet: Puppet failures with "Attempt to assign to a reserved variable name: 'trusted'" - https://phabricator.wikimedia.org/T153246#3022440 (10Paladox) It looks like this has been filled as a bug report here https://tickets.puppetlabs.com/browse/PUP-5441 [17:56:25] 06Operations, 07Puppet: Puppet failures with "Attempt to assign to a reserved variable name: 'trusted'" - https://phabricator.wikimedia.org/T153246#3022463 (10Paladox) Fixed in https://github.com/puppetlabs/puppet/pull/4390 which is puppet 4.3.0+ [17:59:06] 06Operations, 10ops-codfw, 10DBA, 13Patch-For-Review: db2060 not accessible - https://phabricator.wikimedia.org/T156161#3022468 (10Marostegui) Thanks! I will take it from here! [18:00:05] gehel: Respected human, time to deploy Weekly Wikidata query service deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170213T1800). Please do the needful. [18:00:24] nothing to deploy on wdqs today... [18:03:28] I have to be afk now. [18:05:28] (03PS4) 10Nschaaf: Drop wdqs_extract partitions older than 90 days [puppet] - 10https://gerrit.wikimedia.org/r/335437 (https://phabricator.wikimedia.org/T146915) [18:05:37] !log scandium - ex-zuul merger - removing from puppet, revoking puppet cert, salt key.. [18:05:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:07:35] 06Operations, 07Puppet: Puppet failures with "Attempt to assign to a reserved variable name: 'trusted'" - https://phabricator.wikimedia.org/T153246#3022487 (10Paladox) i also found this report https://tickets.puppetlabs.com/browse/PDB-949 [18:08:32] (03CR) 10Chad: [C: 031] zuul: use a proper require for the merger class [puppet] - 10https://gerrit.wikimedia.org/r/337008 (owner: 10Hashar) [18:09:29] (03PS5) 10Nschaaf: Drop wdqs_extract partitions older than 90 days [puppet] - 10https://gerrit.wikimedia.org/r/335437 (https://phabricator.wikimedia.org/T146915) [18:09:45] 06Operations, 07Puppet: Puppet failures with "Attempt to assign to a reserved variable name: 'trusted'" - https://phabricator.wikimedia.org/T153246#3022488 (10Paladox) We need to do "Removing the two storeconfigs-lines from /etc/puppet/puppet.conf and a restart of puppetmaster makes everything work again." Wh... [18:15:09] (03PS1) 10Dzahn: remove scandium, keep mgmt [dns] - 10https://gerrit.wikimedia.org/r/337434 (https://phabricator.wikimedia.org/T150936) [18:15:52] PROBLEM - Ensure mysql credential creation for tools users is running on labstore1005 is CRITICAL: CRITICAL - Expecting active but unit maintain-dbusers is failed [18:16:52] PROBLEM - Check systemd state on labstore1005 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [18:17:33] PROBLEM - puppet last run on ruthenium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:17:59] 06Operations, 10Continuous-Integration-Infrastructure, 06Release-Engineering-Team, 10hardware-requests, 13Patch-For-Review: Phase out scandium.eqiad.wmnet - https://phabricator.wikimedia.org/T150936#3022502 (10Dzahn) - 10:07 < mutante> !log scandium - ex-zuul merger - removing from puppet, revoking puppe... [18:18:05] !log scandium - shutdown -h now (T150936) [18:18:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:18:10] T150936: Phase out scandium.eqiad.wmnet - https://phabricator.wikimedia.org/T150936 [18:22:18] 06Operations, 10Traffic, 07HTTPS, 13Patch-For-Review: Monitor Certificate Transparency (CT) logs - https://phabricator.wikimedia.org/T155807#3022512 (10faidon) 05Open>03Resolved a:03faidon This is done now, we got our first certspotter email on Friday :) There is more to do in this area (better alert... [18:24:12] 06Operations, 10Traffic, 10Wikimedia-General-or-Unknown: Disable caching on the main page for anonymous users - https://phabricator.wikimedia.org/T119366#3022519 (10greg) p:05Unbreak!>03High This is not an emergency, lowing to High. [18:32:30] (03CR) 10Chad: [C: 031] "Minor inline question, but otherwise lgtm" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/337385 (owner: 10Hashar) [18:32:36] 06Operations, 10Continuous-Integration-Infrastructure, 06Release-Engineering-Team, 10hardware-requests, 13Patch-For-Review: Phase out scandium.eqiad.wmnet - https://phabricator.wikimedia.org/T150936#3022562 (10RobH) [18:32:52] RECOVERY - Check systemd state on labstore1005 is OK: OK - running: The system is fully operational [18:33:02] (03CR) 10Chad: [C: 031] jenkins: support umask via service default [puppet] - 10https://gerrit.wikimedia.org/r/337377 (owner: 10Hashar) [18:33:03] !log labstore1005 service maintain-dbusers restart [18:33:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:33:42] RECOVERY - Ensure mysql credential creation for tools users is running on labstore1005 is OK: OK - maintain-dbusers is active [18:37:59] (03CR) 10Dzahn: [C: 032] "switch port is already disabled" [dns] - 10https://gerrit.wikimedia.org/r/337434 (https://phabricator.wikimedia.org/T150936) (owner: 10Dzahn) [18:38:43] 06Operations, 10Continuous-Integration-Infrastructure, 06Release-Engineering-Team, 10hardware-requests, 13Patch-For-Review: Phase out scandium.eqiad.wmnet - https://phabricator.wikimedia.org/T150936#3022594 (10Dzahn) [18:41:03] 06Operations, 10Continuous-Integration-Infrastructure, 06Release-Engineering-Team, 10hardware-requests: Phase out scandium.eqiad.wmnet - https://phabricator.wikimedia.org/T150936#3022600 (10Dzahn) [18:46:32] RECOVERY - puppet last run on ruthenium is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [18:50:55] 06Operations, 10ops-esams, 10hardware-requests: decom cp3011-22 (12 machines) - https://phabricator.wikimedia.org/T130883#3022655 (10Dzahn) >>! In T130883#3015990, @MoritzMuehlenhoff wrote: > Hmm, cp3014,cp3020,cp3022 are still listed in https://servermon.wikimedia.org/hosts/, though. No idea why, let's wait... [18:55:30] 06Operations, 10ops-codfw, 10hardware-requests: decommission ms2001 & ms2002 - https://phabricator.wikimedia.org/T157991#3022661 (10RobH) [18:55:52] 06Operations, 10ops-codfw, 06DC-Ops, 10hardware-requests, 13Patch-For-Review: decom install2001 - https://phabricator.wikimedia.org/T157840#3022675 (10Dzahn) removed from servermon along with install1001 [18:56:52] 06Operations, 10ops-esams, 10hardware-requests: decom cp3011-22 (12 machines) - https://phabricator.wikimedia.org/T130883#3022679 (10RobH) [19:00:05] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Dear anthropoid, the time has come. Please deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170213T1900). [19:00:05] dcausse and kaldari: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [19:00:19] here [19:00:21] o/ [19:00:22] !log upgrading mw1236 with the security updates it missed while it was powered off [19:00:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:01:51] I can SWAT [19:02:55] 06Operations, 10ops-codfw, 10hardware-requests: decommission ms2001 & ms2002 - https://phabricator.wikimedia.org/T157991#3022692 (10RobH) a:05RobH>03Papaul [19:03:08] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/337423 (https://phabricator.wikimedia.org/T155515) (owner: 10DCausse) [19:03:53] 06Operations, 10ops-codfw, 10hardware-requests: decommission ms2001 & ms2002 - https://phabricator.wikimedia.org/T157991#3022661 (10RobH) The disks should have been wiped by our Tampa onsite before they were shipped to codfw. However, I don't want to trust that, since it wasn't documented and we cannot be c... [19:04:58] (03Merged) 10jenkins-bot: [cirrus] properly set wgCirrusSearchUseIcuFolding [mediawiki-config] - 10https://gerrit.wikimedia.org/r/337423 (https://phabricator.wikimedia.org/T155515) (owner: 10DCausse) [19:06:02] dcausse: your change is live on mwdebug1002, check please [19:06:10] thcipriani: looking [19:06:41] (03CR) 10jenkins-bot: [cirrus] properly set wgCirrusSearchUseIcuFolding [mediawiki-config] - 10https://gerrit.wikimedia.org/r/337423 (https://phabricator.wikimedia.org/T155515) (owner: 10DCausse) [19:08:47] thcipriani: I can't test without mwscript on mwdebug1002... I'd suggest moving forward and I'll double check on terbium once deployed [19:09:16] dcausse: ok, will deploy [19:11:01] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:337423|properly set wgCirrusSearchUseIcuFolding]] T155515 (duration: 00m 41s) [19:11:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:11:05] T155515: Reindex el, en, fr and he wikis to enable ICU folding - https://phabricator.wikimedia.org/T155515 [19:11:09] ^ dcausse live everywhere [19:11:15] thcipriani: checking [19:11:34] thcipriani: sounds good, thanks! [19:12:25] (03PS3) 10Thcipriani: Enable Echo on loginwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336548 (https://phabricator.wikimedia.org/T157105) (owner: 10Kaldari) [19:12:33] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336548 (https://phabricator.wikimedia.org/T157105) (owner: 10Kaldari) [19:13:36] MaxSem: any chance you could take responsibility for https://phabricator.wikimedia.org/T157708? I'm having trouble getting anyone to own up to knowing about that instance. [19:13:56] (03Merged) 10jenkins-bot: Enable Echo on loginwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336548 (https://phabricator.wikimedia.org/T157105) (owner: 10Kaldari) [19:14:06] (03CR) 10jenkins-bot: Enable Echo on loginwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336548 (https://phabricator.wikimedia.org/T157105) (owner: 10Kaldari) [19:15:01] kaldari: your change is live on mwdebug1002, check please [19:15:02] andrewbogott, I have only basic knowledge about the stack it uses - worse, I know only how to brick instances with dist-upgrades. so, I can try but satisfaction is not guaranteed:P [19:15:58] (this is a purely volunteer-run project, none of our team participated) [19:16:13] MaxSem: I don't mind giving a dist-upgrade a try. But replacing it with a modern build would be much better... [19:16:21] do you have any idea who is currently involved? If anyone? [19:16:44] aude, did you participate? ^ [19:19:39] thcipriani: checking... [19:20:14] thcipriani: hmm, seems to break things :( [19:20:47] unfortunately, it would give me a stacktrace though [19:20:53] would=won't [19:21:01] _joe_: I'm back. Please let me know when you're free [19:21:16] just for a consult [19:22:11] thcipriani: hmm, well now it's working fine [19:22:20] kaldari: hrm. I see an handful of things on logstash. Nothing very helpful. https://logstash.wikimedia.org/goto/fda3af321c7654e911acbf92ef739bb9 [19:22:28] !log running namespaceDupes.php on fiwiki (T103786) [19:22:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:22:33] T103786: Investigate installing php5.3 on trusty and/or debian instance - https://phabricator.wikimedia.org/T103786 [19:22:56] resourceloader cache misses, but that's about it, really [19:23:09] (03PS2) 10Nuria: [WIP] Changes to perf consumer of event logging events [puppet] - 10https://gerrit.wikimedia.org/r/337158 (https://phabricator.wikimedia.org/T156760) [19:24:09] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Changes to perf consumer of event logging events [puppet] - 10https://gerrit.wikimedia.org/r/337158 (https://phabricator.wikimedia.org/T156760) (owner: 10Nuria) [19:24:14] thcipriani: This is useful: "Table 'loginwiki.echo_target_page' doesn't exist...", although I was told that all the needed tables already exist... lemme check... [19:24:20] argh wrong task [19:25:44] 06Operations, 10Gerrit, 13Patch-For-Review: Investigate seemingly random Gerrit slow-downs - https://phabricator.wikimedia.org/T148478#2739193 (10Gehel) 2 observations after doing a short dive into GC logs: * it looks like most of the time (looking at logs over the past several days) the heap after GC is be... [19:26:07] thcipriani: I'm going to revert for now and come back to this later. Looks like there may still be some back-end requirements missing. [19:26:20] kaldari: ok, sounds good. [19:26:28] (03PS1) 10Kaldari: Revert "Enable Echo on loginwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/337437 [19:26:42] thcipriani: ^ [19:26:58] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/337437 (owner: 10Kaldari) [19:28:31] (03Merged) 10jenkins-bot: Revert "Enable Echo on loginwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/337437 (owner: 10Kaldari) [19:28:40] (03CR) 10jenkins-bot: Revert "Enable Echo on loginwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/337437 (owner: 10Kaldari) [19:29:35] kaldari: ok, mwdebug1002 should be back where we started. Thank you for checking your changes :) [19:30:10] thcipriani: Yep, back to normal [19:36:27] (03PS3) 10Dzahn: switch apt.wm.org from carbon to install1002 [dns] - 10https://gerrit.wikimedia.org/r/337196 (https://phabricator.wikimedia.org/T132757) [19:37:02] (03PS1) 10RobH: new shell user Nithum Thain [puppet] - 10https://gerrit.wikimedia.org/r/337438 [19:38:04] !log carbon - synced /srv/ data to install1002/2002 for the last time, switching apt.wikimedia.org CNAME to install1002 - carbon deprecated (T132757) [19:38:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:38:09] T132757: Split carbon's install/mirror roles, provision install1001 - https://phabricator.wikimedia.org/T132757 [19:38:30] (03CR) 10jerkins-bot: [V: 04-1] new shell user Nithum Thain [puppet] - 10https://gerrit.wikimedia.org/r/337438 (owner: 10RobH) [19:38:36] (03CR) 10Dzahn: [C: 032] switch apt.wm.org from carbon to install1002 [dns] - 10https://gerrit.wikimedia.org/r/337196 (https://phabricator.wikimedia.org/T132757) (owner: 10Dzahn) [19:39:17] opps, left out a comma. [19:39:45] (03PS2) 10RobH: new shell user Nithum Thain [puppet] - 10https://gerrit.wikimedia.org/r/337438 [19:40:35] (03CR) 10jerkins-bot: [V: 04-1] new shell user Nithum Thain [puppet] - 10https://gerrit.wikimedia.org/r/337438 (owner: 10RobH) [19:40:50] Hi all (again!) - does anyone have five minutes to step me through accessing some EventLogging on stat1003? I'm `statistics-users` but I have a sinking feeling that I need access to `research-client.cnf` to get at the data I need [19:41:15] samtar: try asking in #wikimedia-analytics? [19:41:32] legoktm: will do :) thanks for the nudge [19:42:10] (03PS3) 10RobH: new shell user Nithum Thain [puppet] - 10https://gerrit.wikimedia.org/r/337438 [19:42:17] and my editor added the comma, saved, and then said disk changes present and undid my comma... damn you bbedit for being odd. [19:42:42] robh: never thanked you for all the help in getting me set up, so thank you :) [19:42:49] welcome =] [19:46:53] (03PS2) 10ArielGlenn: little tool that displays the last page id in bz2 xml content file [dumps/mwbzutils] - 10https://gerrit.wikimedia.org/r/337341 [19:47:01] robh: actually whilst you're semi-here - if I need my access group changed (pending a quick chat with -analytics) what's the process? [19:48:21] as in the one you were assigned is wrong and needs changing? [19:48:36] likely can just reopen the existing task since its recent [19:48:44] asking for it to be changed and why [19:48:58] (if it was more than a few days old id suggest a new task but its a recent task ;) [19:49:17] Okay, entirely my fault as I'm a little new to all this, need access to a .cnf [19:49:46] i think thts analytics-privatedata-users [19:49:52] Access to stat1002 and stat1004 to connect to the Analytics/Cluster (Hadoop/Hive) and to query private data hosted there, including webrequest logs. Access to MariaDB slaves in /etc/mysql/conf.d/analytics-research-client.cnf [19:50:20] or Access to stat1002, public data like sampled webrequest logs (stored under /a/log/webrequest/archive) and for the MariaDB slaves in /etc/mysql/conf.d/statistics-private-client.cnf [19:50:24] statistics-privatedata-users [19:50:30] so not sure which one [19:50:50] samtar: if you get it figured out which one you need, then yeah just reopen the access request https://phabricator.wikimedia.org/T157483 with the correction [19:50:59] I think it's `researchers` actually, only need access to stat1003 and the research-client.cnf [19:51:08] but I'll confirm with analytics [19:51:37] cool, just as easy [19:51:40] once we know we can correct [19:51:52] just update the task so we have a record and lemme know =] [19:52:16] I think its pretty clear this isn't you trying to escalate your rights bypassing a wait period or anthing [19:52:41] if there was doubt, the worst case would be another 3 day wait, but i dont think that should apply here. [19:59:48] (03PS1) 10Jdlrobson: Add Hindi Wikipedia wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/337441 [19:59:50] (03PS1) 10Jdlrobson: Update Hebrew wordmark logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/337442 (https://phabricator.wikimedia.org/T157863) [20:00:35] (03PS2) 10Jdlrobson: Disable Hungarian Popups A/B test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/337064 (https://phabricator.wikimedia.org/T156290) [20:05:06] (03PS1) 10Dzahn: install: enable Letsencrypt on install1002 [puppet] - 10https://gerrit.wikimedia.org/r/337443 (https://phabricator.wikimedia.org/T132757) [20:05:52] (03CR) 10Dzahn: [C: 032] "we don't need to care about disabling on carbon, puppet already stopped" [puppet] - 10https://gerrit.wikimedia.org/r/337443 (https://phabricator.wikimedia.org/T132757) (owner: 10Dzahn) [20:09:15] (03PS2) 10Dzahn: install: enable Letsencrypt on install1002 [puppet] - 10https://gerrit.wikimedia.org/r/337443 (https://phabricator.wikimedia.org/T132757) [20:09:27] mutante: robh: thank you for the scandium phase out !!! :] [20:09:35] (03PS3) 10Dzahn: install: enable Letsencrypt on install1002 [puppet] - 10https://gerrit.wikimedia.org/r/337443 (https://phabricator.wikimedia.org/T132757) [20:09:41] (03CR) 10Dzahn: [V: 032 C: 032] install: enable Letsencrypt on install1002 [puppet] - 10https://gerrit.wikimedia.org/r/337443 (https://phabricator.wikimedia.org/T132757) (owner: 10Dzahn) [20:09:47] I specially love the huge list of check boxes being striked in a single go https://phabricator.wikimedia.org/transactions/detail/PHID-XACT-TASK-hcc6bkfhechhpzg/ : ] [20:10:18] it is out of warranty / too old isn't it? [20:10:50] 06Operations, 10Continuous-Integration-Infrastructure, 06Release-Engineering-Team, 10hardware-requests: Phase out scandium.eqiad.wmnet - https://phabricator.wikimedia.org/T150936#3023070 (10hashar) [20:11:19] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Request for access to stat1003 for Sam Tarling - https://phabricator.wikimedia.org/T157483#3023087 (10Samtar) 05Resolved>03Open [**fingers crossed this is the last time I reopen this!**] Completely my fault, I didn't realise I would need access to... [20:12:29] robh: should I assign it to you? ^ apologies for the second reopen D: [20:12:53] ill fix right now for ya [20:12:59] no worries [20:14:23] These access groups, SSH bastioning and all that are really great ideas but my goodness does it take a while to get your head around it o.O [20:15:48] The analytics groups are known to be more confusing than the others [20:15:55] (03PS1) 10RobH: correct samtar's stat1003 access [puppet] - 10https://gerrit.wikimedia.org/r/337445 [20:15:57] its why they have to have their own sub page explaining them, heh [20:18:13] (03CR) 10RobH: [C: 032] correct samtar's stat1003 access [puppet] - 10https://gerrit.wikimedia.org/r/337445 (owner: 10RobH) [20:18:46] samtar: just waiting for ci testing to catch up and i'll merge it live and manually run on stat1003 to update [20:18:57] robh: you're a star thank you :) [20:20:37] hashar: you're welcome. :) the "is it too old" question i refer to Robh [20:21:05] its very old [20:21:19] so decom is the default for it? [20:21:34] (not sure about the question though?) [20:21:41] too old for what? [20:22:19] if there is a proposed use for it after reclaim? [20:25:46] (03CR) 10Dzahn: "SSL cert fixed with https://gerrit.wikimedia.org/r/337443" [dns] - 10https://gerrit.wikimedia.org/r/337196 (https://phabricator.wikimedia.org/T132757) (owner: 10Dzahn) [20:28:57] 06Operations, 10Gerrit, 13Patch-For-Review: Investigate seemingly random Gerrit slow-downs - https://phabricator.wikimedia.org/T148478#3023122 (10Gehel) Next step of investigation might be to collect regular thread dumps and see if we can see something strange during the next issue. [20:33:56] !log dropped old out of date echo tables from extension1.loginwiki T157105 [20:34:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:34:50] robh: can I give it a whirl yet? [20:35:06] sorry, got distracted on a call [20:35:15] and missed my mege window, rebaseing due to ff only in repo [20:35:16] =P [20:35:22] (03PS2) 10RobH: correct samtar's stat1003 access [puppet] - 10https://gerrit.wikimedia.org/r/337445 [20:35:43] its only thin in the zuul queue right now, so should be just a moment [20:36:00] * samtar nods like he understands what that even means [20:36:12] basically each patchset has to go through a suite of tests [20:36:23] i got distracted since the queue was backed up a few minutes ago and i waited too long [20:36:32] and someone else merged something so i have to rebase and merge =] [20:36:57] merging live and running puppet on stat1003 now [20:37:32] PROBLEM - puppet last run on db1078 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:38:21] that doesn't look good! :P [20:38:53] rip [20:40:54] (03PS2) 10Andrew Bogott: WIP Apt: Remove an ensure->absent stanza [puppet] - 10https://gerrit.wikimedia.org/r/336954 [20:40:56] (03PS1) 10Andrew Bogott: Remove openstack::clientlib from icinga hosts [puppet] - 10https://gerrit.wikimedia.org/r/337449 (https://phabricator.wikimedia.org/T157760) [20:41:43] (03PS1) 10Kaldari: Enable Echo on loginwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/337450 (https://phabricator.wikimedia.org/T157105) [20:44:48] (03CR) 10Andrew Bogott: [C: 04-2] "These need to be modified to run on a host that already installs OS clients. Most likely labcontrol." [puppet] - 10https://gerrit.wikimedia.org/r/334658 (owner: 10Andrew Bogott) [20:44:55] (03CR) 10Andrew Bogott: [C: 032] Remove openstack::clientlib from icinga hosts [puppet] - 10https://gerrit.wikimedia.org/r/337449 (https://phabricator.wikimedia.org/T157760) (owner: 10Andrew Bogott) [20:47:02] (03PS2) 10Dzahn: let install1002 be the new source for APT data rsync [puppet] - 10https://gerrit.wikimedia.org/r/337198 (https://phabricator.wikimedia.org/T132757) [20:47:11] (03PS3) 10Dzahn: let install1002 be the new source for APT data rsync [puppet] - 10https://gerrit.wikimedia.org/r/337198 (https://phabricator.wikimedia.org/T132757) [20:47:21] samtar: go for it now [20:47:38] you may have to log out of stat1003 and back onto it [20:47:57] but your change is now live on it [20:48:46] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Request for access to stat1003 for Sam Tarling - https://phabricator.wikimedia.org/T157483#3023135 (10RobH) 05Open>03Resolved a:03RobH I've merged the requested change and puppet has run on stat1003. This should be working at this time. [20:48:52] gotta run down the street for lunch, back in less than 10 [20:49:24] (03CR) 10Dzahn: [C: 032] let install1002 be the new source for APT data rsync [puppet] - 10https://gerrit.wikimedia.org/r/337198 (https://phabricator.wikimedia.org/T132757) (owner: 10Dzahn) [20:51:11] robh: `Could not open required defaults file: /etc/mysql/conf.d/research-client.cnf` [20:51:57] !log carbon/install - adjusted Letsencrypt cert creation, deactivated reprepro to protect from accidental use, switching rsync direction from install1002->install2002, disabled cron on carbon (T132757) [20:52:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:52:02] T132757: Split carbon's install/mirror roles, provision install1001 - https://phabricator.wikimedia.org/T132757 [20:53:46] robh: ah relogged and it's all good, thank you! [21:00:05] gwicke, cscott, arlolra, subbu, bearND, halfak, and Amir1: Dear anthropoid, the time has come. Please deploy Services – Parsoid / OCG / Citoid / Mobileapps / ORES / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170213T2100). [21:00:37] no ores deploy today [21:01:16] cool [21:01:26] samtar: glad it worked =] [21:01:37] rob :) thank you again for all your help! [21:01:50] quite welcome, part of ops clinic duty (which i actually enjoy ;) [21:02:22] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Requesting access to stat1003/analytics-store for mlitn - https://phabricator.wikimedia.org/T157812#3023154 (10RobH) a:03RobH [21:05:20] robh: good morning :] Speaking of clinic I have some old task for the datacenter smart hands :] [21:05:32] RECOVERY - puppet last run on db1078 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [21:05:52] labnodepool1001 in eqiad received some SSD disk but we have no use for them, so I thought maybe we could reclaim the disk for the spare ( https://phabricator.wikimedia.org/T116936 ) [21:06:42] hashar: so the ssds arent mounted or anything? [21:07:27] hashar: i only see dual SSD no sata in there at all? [21:09:50] 06Operations, 10ops-eqiad, 06DC-Ops: Reclaim SSD from labnodepool1001.eqiad.wmnet - https://phabricator.wikimedia.org/T116936#3023175 (10RobH) So this system appears to ONLY have dual SSDs installed. robh@labnodepool1001:~$ sudo lshw -class disk *-disk:0 description: ATA Disk... [21:11:27] robh: ooops [21:11:38] heh, we could do it but you wont like the result! [21:11:47] let me rephrase that [21:11:52] the box solely has ssd ? :] [21:12:01] hashar: correct, and they are old ssds [21:12:09] so honestly we wouldnt use them elsewhere, they are an older series [21:12:15] somehow I thought we had a ssd for / [21:12:18] and another ssd for /srv or something [21:12:21] so not sure its worth reimaging this box to take back dual 160GB SSDs [21:12:31] if they are, they are disabled in bios and not showing [21:12:35] which would be odd. [21:12:46] (but there is no way to know without an onsite check or reboot into bios) [21:12:57] na na [21:13:04] or.. it has hdd and they are disabled [21:13:06] but yeah [21:13:11] I guess it is has been installed with two 150 GB disk in raid [21:13:13] its odd to see any disks disabled in bios, let alone ssds [21:13:22] yep, dual 160GB intel M160SSDs [21:13:27] and I got confused thinking it had regulard disks for / and ssd for /srv [21:13:32] PROBLEM - puppet last run on elastic1034 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:15:19] 06Operations, 10ops-eqiad, 06DC-Ops: Reclaim SSD from labnodepool1001.eqiad.wmnet - https://phabricator.wikimedia.org/T116936#3023181 (10hashar) 05Open>03declined I guess when I have opened the task I thought the machine had regular disk for / and ssd for /srv as we did on gallium back in the old day. D... [21:15:22] robh: thank you Doctor System! [21:15:29] one less task :] [21:15:34] I have declined i [21:15:34] t [21:15:37] successful ops clinic duty [21:37:06] (03CR) 10Raimond Spekking: [C: 04-1] "This is not necessary with adding $namespaceAliases as suggested in https://gerrit.wikimedia.org/r/#/c/336986/3" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/337192 (https://phabricator.wikimedia.org/T157846) (owner: 10MarcoAurelio) [21:37:58] if you merge something in mediawiki/extensions/Collection is there any other step after "merge in gerrit" that needs to happen? [21:38:38] @cscott [21:38:45] @seen cscott [21:38:46] mutante: Last time I saw cscott they were changing the nickname to , but is no longer in channel #wikimedia-collaboration at 1/28/2017 6:41:12 AM (16d14h57m33s ago) [21:39:44] !log bsitzmann@tin Started deploy [mobileapps/deploy@f6b4435]: Update mobileapps to 3af473f [21:39:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:41:25] (03CR) 10MarcoAurelio: "> This is not necessary with adding $namespaceAliases as suggested in" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/337192 (https://phabricator.wikimedia.org/T157846) (owner: 10MarcoAurelio) [21:42:32] RECOVERY - puppet last run on elastic1034 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [21:43:28] !log bsitzmann@tin Finished deploy [mobileapps/deploy@f6b4435]: Update mobileapps to 3af473f (duration: 03m 44s) [21:43:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:44:28] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Requesting access to stat1003/analytics-store for mlitn - https://phabricator.wikimedia.org/T157812#3023272 (10RobH) p:05Triage>03Normal [21:46:13] PROBLEM - puppet last run on ms-fe1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:48:38] 06Operations, 10Traffic, 10Wikimedia-General-or-Unknown: Disable caching on the main page for anonymous users - https://phabricator.wikimedia.org/T119366#3023275 (10Ijon) Agree re lowering to High. But it does seem like a bug and not etwiki's using inappropriate magic words. [21:50:02] (03CR) 10Dereckson: [C: 031] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/337442 (https://phabricator.wikimedia.org/T157863) (owner: 10Jdlrobson) [21:52:10] 06Operations, 13Patch-For-Review: Split carbon's install/mirror roles, provision install1001 - https://phabricator.wikimedia.org/T132757#3023288 (10Dzahn) [21:52:23] 06Operations, 13Patch-For-Review: Split carbon's install/mirror roles, provision install1001 - https://phabricator.wikimedia.org/T132757#2853059 (10Dzahn) [21:53:29] 06Operations: Setting up a mirror serv{er,ice} - https://phabricator.wikimedia.org/T84817#3023317 (10Dzahn) @faidon Is this ticket resolved since we have sodium assigned as mirrors.wm.org or should carbon still become a mirror server now that it's not an install server anymore? [21:54:33] PROBLEM - Disk space on elastic1018 is CRITICAL: DISK CRITICAL - free space: /var/lib/elasticsearch 65431 MB (12% inode=99%) [21:55:30] (03CR) 10Kaldari: [C: 032] Enable Echo on loginwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/337450 (https://phabricator.wikimedia.org/T157105) (owner: 10Kaldari) [21:55:41] 06Operations, 13Patch-For-Review: Migrate carbon to jessie - https://phabricator.wikimedia.org/T123733#3023332 (10Dzahn) carbon has been replaced with install1002 and install2002 (and sodium). carbon itself will be decom'ed since it's out of warranty. [21:56:07] \o/ [21:57:25] (03Merged) 10jenkins-bot: Enable Echo on loginwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/337450 (https://phabricator.wikimedia.org/T157105) (owner: 10Kaldari) [21:57:41] (03CR) 10jenkins-bot: Enable Echo on loginwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/337450 (https://phabricator.wikimedia.org/T157105) (owner: 10Kaldari) [21:59:32] PROBLEM - Disk space on elastic1018 is CRITICAL: DISK CRITICAL - free space: /var/lib/elasticsearch 65238 MB (12% inode=99%) [21:59:54] (03Abandoned) 10MarcoAurelio: Adding "Categoria:" as namespace alias for ext.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/337192 (https://phabricator.wikimedia.org/T157846) (owner: 10MarcoAurelio) [22:00:04] dapatrick, bawolff, and Reedy: Respected human, time to deploy Weekly Security deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170213T2200). Please do the needful. [22:00:04] kaldari: A patch you scheduled for Weekly Security deployment window is about to be deployed. Please be available during the process. [22:00:31] Reedy: I just merged the config change on tin (but haven't synced). [22:00:50] ok, so.. [22:00:53] Reedy: So hopefully you should be able to run the script [22:01:17] reedy@tin:/srv/mediawiki-staging/php-1.29.0-wmf.11$ mwscript extensions/WikimediaMaintenance/createExtensionTables.php --wiki=loginwiki Echo [22:01:17] Using special database connection for EchoCreating Echo tables...done! [22:01:25] | echo_email_batch | [22:01:25] | echo_event | [22:01:25] | echo_notification | [22:01:25] | echo_target_page | [22:03:34] kaldari: scap pull'd it onto mwdebug1002 [22:03:46] Reedy: OK, I'll test there.... [22:03:52] PROBLEM - puppet last run on labtestservices2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[/usr/local/bin/labs-ip-alias-dump.py] [22:04:55] hmm, not appearing in special:version [22:05:09] * Reedy tries clicking on [22:05:16] Now it is [22:05:34] Reedy: Seems to be working fine, no errors [22:06:48] kaldari: Ok, we're getting a weird race condition I've seen before [22:07:00] Had it when creating new wikis [22:07:10] It tells me I have one notification on the icon [22:07:15] But "You have no notifications." [22:07:20] RoanKattouw: Do you remember this before? [22:07:38] Hmm there can be a few different cauess [22:07:46] Does it happen on only one wiki or on all wikis? [22:07:58] And what is the wiki + user name? [22:08:12] Special:Notifications on loginwiki is very different to Special:Notifications on enwiki [22:08:48] I just get "No such special page" [22:08:51] 06Operations, 10Traffic, 10Wikimedia-General-or-Unknown: Disable caching on the main page for anonymous users - https://phabricator.wikimedia.org/T119366#1824561 (10Tgr) `#time` and co. are used on many pages and usually they do not require cache invalidation. For example [[https://en.wikipedia.org/wiki/Temp... [22:08:57] RoanKattouw: mwdebug1002 [22:09:15] Extension is in the process of being enabled [22:09:21] Tables just recreated [22:09:44] https://usercontent.irccloud-cdn.com/file/SzxyIQbp/loginwiki-notifs.png [22:09:46] (03PS1) 10Eevans: Enable Prometheus exporter on restbase1007 (canary) [puppet] - 10https://gerrit.wikimedia.org/r/337493 (https://phabricator.wikimedia.org/T155120) [22:09:47] Looks fine to me [22:10:06] Ah it is for me too [22:10:24] I guess something caught up [22:10:24] That's expected to happen because I have never received any notifs on that wiki (obviously) [22:10:30] Oh was it different before? [22:10:46] RoanKattouw: Yeah, I literally had the "You have no notifications." text [22:10:53] None of the fancy js stuff [22:11:00] Oh, yeah, I saw that at first too but then JS loaded and replaced it [22:11:14] So possibly some caching issue [22:11:20] Yeah perhaps [22:11:24] kaldari: I guess we're good to push it out [22:11:33] Annoying rollout niggles, not anything actually broken [22:11:34] 06Operations, 06Commons, 10MediaWiki-File-management, 06Multimedia, 05MW-1.28-release-notes: File does not thumbnail, doesn't have extracted metadata, has reported zero width/height (due to garbage bytes between JPEG sections) - https://phabricator.wikimedia.org/T148606#3023443 (10matmarex) 05Open>03R... [22:13:10] Reedy: OK, should I sync it now? [22:13:17] sure [22:14:02] ! log scap sync-dir dblists/ 'Turning on Echo for loginwiki' [22:14:05] (03PS1) 10Dzahn: aptrepo: rsync the entire /srv/ automatically, not just /srv/wikimedia/ [puppet] - 10https://gerrit.wikimedia.org/r/337498 [22:14:07] !log scap sync-dir dblists/ 'Turning on Echo for loginwiki' [22:14:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:14:12] RECOVERY - puppet last run on ms-fe1001 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [22:14:33] !log kaldari@tin Synchronized dblists/: Turning on Echo for loginwiki (duration: 00m 41s) [22:14:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:14:41] all done [22:14:48] checking live site... [22:15:15] looks good! [22:16:23] Reedy: Thanks for the help!! [22:16:34] No problem :) [22:18:22] 06Operations, 06Release-Engineering-Team, 06Wikipedia-Android-App-Backlog, 07Jenkins, and 2 others: Investigate how to improve CI performance and stability - https://phabricator.wikimedia.org/T158014#3023491 (10Fjalapeno) [22:19:03] 06Operations, 06Release-Engineering-Team, 06Wikipedia-Android-App-Backlog, 07Jenkins, and 2 others: Investigate how to improve Android CI performance and stability - https://phabricator.wikimedia.org/T158014#3023493 (10greg) [22:20:31] (03CR) 10Dzahn: ""precise-compat branch should be checked out if this is a precise installation" can we remove this or is precise still used with labs_vag" [puppet] - 10https://gerrit.wikimedia.org/r/337205 (owner: 10Dzahn) [22:20:32] RECOVERY - Disk space on elastic1018 is OK: DISK OK [22:21:08] (03CR) 10Dzahn: [C: 04-1] "needs to stay until the very last precise host .. or ... ?" [puppet] - 10https://gerrit.wikimedia.org/r/337204 (owner: 10Dzahn) [22:22:59] 06Operations, 06Release-Engineering-Team, 07HHVM, 07Wikimedia-Incident: 2016-10-17 API cluster overload - https://phabricator.wikimedia.org/T148652#3023511 (10greg) So, this never happened (this == the investigation). It is now almost 4 months later, with a new year in between. There was no incident report... [22:26:20] !log update RESTBase to 0e9106ab8 - staging [22:26:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:29:22] PROBLEM - restbase endpoints health on xenon is CRITICAL: /feed/onthisday/{type}/{mm}/{dd} (Retrieve selected the events for Jan 01) is CRITICAL: Test Retrieve selected the events for Jan 01 responds with malformed body: list index out of range [22:30:15] Pchelolo: ^ is that related to your deploy? [22:31:00] Reedy: yep. All is under control [22:31:31] It's in staging environment, something is a bit off with a checker script we're using there [22:32:58] (03CR) 10BryanDavis: "> "precise-compat branch should be checked out if this is a precise" [puppet] - 10https://gerrit.wikimedia.org/r/337205 (owner: 10Dzahn) [22:34:04] (03CR) 10Dzahn: "ok. yes, not having watroles is a major loss" [puppet] - 10https://gerrit.wikimedia.org/r/337205 (owner: 10Dzahn) [22:37:11] mutante: we can certainly kill it after the non-CI precise instances in Labs are gone. That whole role is slated to die too (T121477) [22:37:12] T121477: Migrate projects using ::role::deprecated::labsvagrant to ::role::labs::mediawiki_vagrant - https://phabricator.wikimedia.org/T121477 [22:39:00] !log rollback RESTBase to ea980cc5d - staging [22:39:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:40:23] RECOVERY - restbase endpoints health on xenon is OK: All endpoints are healthy [22:45:02] PROBLEM - check_puppetrun on pay-lvs1002 is CRITICAL: CRITICAL: puppet fail [22:45:12] PROBLEM - check_puppetrun on lutetium is CRITICAL: CRITICAL: puppet fail [22:45:13] PROBLEM - check_puppetrun on barium is CRITICAL: CRITICAL: puppet fail [22:46:15] ^ sorry about the puppetsplosion, fixing [22:47:53] 06Operations, 10hardware-requests: spare ex4200s - check on quantity for potential shipment to OIT - https://phabricator.wikimedia.org/T157839#3023560 (10RobH) p:05Triage>03Normal [22:50:12] RECOVERY - check_puppetrun on lutetium is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [22:51:47] 06Operations, 10Continuous-Integration-Config, 06Release-Engineering-Team, 06Wikipedia-Android-App-Backlog, and 2 others: Investigate how to improve Android CI performance and stability - https://phabricator.wikimedia.org/T158014#3023564 (10hashar) The history of recents builds on https://integration.wikim... [22:55:12] RECOVERY - check_puppetrun on pay-lvs1002 is OK: OK: Puppet is currently enabled, last run 283 seconds ago with 0 failures [23:00:12] RECOVERY - check_puppetrun on barium is OK: OK: Puppet is currently enabled, last run 298 seconds ago with 0 failures [23:11:42] PROBLEM - puppet last run on mw1232 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:24:35] !log bsitzmann@tin Started deploy [mobileapps/deploy@cd3b897]: Update mobileapps to 776211b [23:24:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:27:54] !log bsitzmann@tin Finished deploy [mobileapps/deploy@cd3b897]: Update mobileapps to 776211b (duration: 03m 19s) [23:27:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:28:12] PROBLEM - puppet last run on pc1006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:29:12] !log update RESTBase to 0e9106ab8 - staging [23:29:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:31:30] (03PS3) 10Dzahn: prometheus: add v6 reverse records [dns] - 10https://gerrit.wikimedia.org/r/337422 (https://phabricator.wikimedia.org/T154504) [23:35:37] !log update RESTBase to 0e9106ab8 - canary on restbase1007 [23:35:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:38:22] PROBLEM - restbase endpoints health on restbase1007 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.64.0.223, port=7231): Max retries exceeded with url: /en.wikipedia.org/v1/?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [23:38:52] PROBLEM - Restbase root url on restbase1007 is CRITICAL: connect to address 10.64.0.223 and port 7231: Connection refused [23:39:30] ^^^ that's ok, it's creating keyspaces and it's depooled [23:39:42] RECOVERY - puppet last run on mw1232 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [23:42:56] (03PS1) 10Dereckson: Clean Wikisource namespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/337520 (https://phabricator.wikimedia.org/T46320) [23:42:59] 06Operations: Setting up a mirror serv{er,ice} - https://phabricator.wikimedia.org/T84817#3023668 (10faidon) 05Open>03Resolved a:03faidon Nope, done for a while now :) [23:44:32] RECOVERY - restbase endpoints health on restbase1007 is OK: All endpoints are healthy [23:44:52] RECOVERY - Restbase root url on restbase1007 is OK: HTTP OK: HTTP/1.1 200 - 15500 bytes in 0.025 second response time [23:45:01] !log update RESTBase to 0e9106ab8 [23:45:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:47:06] 06Operations, 13Patch-For-Review: Migrate carbon to jessie - https://phabricator.wikimedia.org/T123733#3023684 (10Dzahn) a:03Dzahn [23:48:21] 06Operations, 10Continuous-Integration-Config, 06Release-Engineering-Team, 06Wikipedia-Android-App-Backlog, and 2 others: Investigate how to improve Android CI performance and stability - https://phabricator.wikimedia.org/T158014#3023688 (10Niedzielski) FWIW, here's the corresponding device log for [[ http... [23:48:48] (03CR) 10Gergő Tisza: ores: Send logs to logstash (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/321096 (https://phabricator.wikimedia.org/T149010) (owner: 10Ladsgroup) [23:48:53] 06Operations, 10Deployment-Systems, 10Stashbot: [[wikitech:Server_admin_log]] should not rely on freenode irc for logmsgbot entries - https://phabricator.wikimedia.org/T46791#480032 (10demon) >>! In T46791#480052, @greg wrote: > There's two separate use cases here: > > 1) Someone in -operations !log'ing som... [23:55:16] 06Operations, 05Goal: reduce amount of remaining Ubuntu 12.04 (precise) systems - https://phabricator.wikimedia.org/T123525#3023703 (10Dzahn) [23:55:18] 06Operations, 13Patch-For-Review: Migrate carbon to jessie - https://phabricator.wikimedia.org/T123733#3023701 (10Dzahn) 05Open>03Resolved I'll close this as ..eh.. something between resolved and rejected :) per the comment above. I will make a subtask for decom'ing it. [23:56:12] RECOVERY - puppet last run on pc1006 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [23:56:52] 06Operations, 10Deployment-Systems, 10Stashbot: [[wikitech:Server_admin_log]] should not rely on freenode irc for logmsgbot entries - https://phabricator.wikimedia.org/T46791#3023711 (10bd808) Some related thoughts/explanations on {T156079}. wm-bot already does too many things IMO, and Stashbot currently ow...