[00:37:43] 10Operations, 10Commons, 10Multimedia, 10Traffic, and 2 others: Disable serving unpatrolled new files to Wikipedia Zero users - https://phabricator.wikimedia.org/T167400#3361808 (10Poyekhali) >>! In T167400#3361750, @Platonides wrote: >>>! In T167400#3356482, @Poyekhali wrote: >> Unless Commons have a lot... [01:21:30] (03CR) 10jenkins-bot: Add atjwiki to securepollglobal.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/359964 (owner: 10Reedy) [01:22:19] (03CR) 10jenkins-bot: Add sandbox link for dtywiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/359387 (https://phabricator.wikimedia.org/T168038) (owner: 10DatGuy) [01:23:09] (03CR) 10jenkins-bot: Create logo for the Kabiye Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344562 (https://phabricator.wikimedia.org/T160868) (owner: 10Odder) [01:23:11] (03CR) 10jenkins-bot: Update instances of Wikimedia Foundation logo #1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/307475 (https://phabricator.wikimedia.org/T144254) (owner: 10Urbanecm) [01:23:13] (03CR) 10jenkins-bot: Change AbuseFilter block duration for fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358156 (https://phabricator.wikimedia.org/T167562) (owner: 10Huji) [01:23:15] (03CR) 10jenkins-bot: Use directly wgGalleryOptions without wmg [mediawiki-config] - 10https://gerrit.wikimedia.org/r/331819 (owner: 10Dereckson) [01:23:17] (03CR) 10jenkins-bot: Upload logos for the Dinka Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358883 (owner: 10Odder) [01:23:19] (03CR) 10jenkins-bot: Enable OOjs UI buttons on EditPage for plwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/359514 (https://phabricator.wikimedia.org/T162849) (owner: 10Jforrester) [01:23:21] (03CR) 10jenkins-bot: [cleanup] remove old interwiki search config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357642 (owner: 10DCausse) [01:23:23] (03CR) 10jenkins-bot: Add “Constraints” section on test.wikidata.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/359135 (https://phabricator.wikimedia.org/T167126) (owner: 10Lucas Werkmeister (WMDE)) [01:23:25] (03CR) 10jenkins-bot: db-eqiad.php: Repool db1070 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/359941 (owner: 10Marostegui) [01:23:27] (03CR) 10jenkins-bot: Add atj to InterwikiSortOrders [mediawiki-config] - 10https://gerrit.wikimedia.org/r/359810 (https://phabricator.wikimedia.org/T167714) (owner: 10Reedy) [01:23:29] (03CR) 10jenkins-bot: Remove $wgEnableValidationStatisticsUpdates from FlaggedRevs config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354600 (owner: 10Nemo bis) [01:38:17] o_0 [01:46:30] poor Reedy [01:59:25] (03CR) 10Ottomata: "Volans! You are amazing, thank you! I will def respond to all of these. Note this is so WIP still, tests to come. Tons of TODOs for me" [software/certpy] - 10https://gerrit.wikimedia.org/r/359960 (https://phabricator.wikimedia.org/T166167) (owner: 10Ottomata) [03:25:49] A Troublesome Encounter! [03:25:50] Woe! This request had its journey cut short by unexpected circumstances (Can Not Connect to MySQL). [03:25:52] on Phabricator [03:25:59] back now... [03:26:41] still getting it :/ [03:39:50] 10Operations, 10DBA, 10Goal: Migrate MySQLs to use ROW-based replication - https://phabricator.wikimedia.org/T109179#3361933 (10MZMcBride) What's the status of this task? The previous comment is from over a year ago. [03:40:27] 10Operations, 10Traffic, 10Wikimedia-Blog, 10HTTPS: Change automatic shortlink in blog theme - https://phabricator.wikimedia.org/T165511#3268418 (10Tbayer) Cool - I assume this has had enough eyes; merging and submitting now. (BTW, for later internal reference, that was [[https://wordpressvip.zendesk.com/h... [04:27:03] (03PS1) 10Phuedx: relatedArticles: SamplingRate -> BucketSize [mediawiki-config] - 10https://gerrit.wikimedia.org/r/360166 (https://phabricator.wikimedia.org/T167236) [05:29:41] 10Operations, 10DBA, 10Goal: Migrate MySQLs to use ROW-based replication - https://phabricator.wikimedia.org/T109179#1542524 (10Marostegui) This is an Epic task and quite hard to achieve in short or even medium term. To give you an example, row based replication is quite strict with data drifts and can break... [06:10:43] (03CR) 10Giuseppe Lavagetto: [C: 031] "Please note that this will temporarily remove thumbor from codfw low-traffic lvs servers, that is until thumbor doesn't in fact get config" [puppet] - 10https://gerrit.wikimedia.org/r/357863 (owner: 10Giuseppe Lavagetto) [06:12:58] (03PS11) 10Giuseppe Lavagetto: role::lvs::balancer: refactor to role/profile (step 2) [puppet] - 10https://gerrit.wikimedia.org/r/357863 [06:28:31] (03CR) 10Giuseppe Lavagetto: [C: 032] role::lvs::balancer: refactor to role/profile (step 2) [puppet] - 10https://gerrit.wikimedia.org/r/357863 (owner: 10Giuseppe Lavagetto) [06:33:17] 10Operations, 10Traffic, 10Wikimedia-Blog, 10HTTPS: Change automatic shortlink in blog theme - https://phabricator.wikimedia.org/T165511#3362021 (10Tbayer) This has been deployed. Per a quick look, the shortlink in Volker's example above has been fixed (` 10Operations, 10DBA: Prepare mysql hosts for stretch - https://phabricator.wikimedia.org/T168356#3362038 (10jcrespo) [06:49:24] 10Operations, 10Community-Wikimetrics, 10DBA, 10Icinga, and 2 others: Evaluate future of wmf puppet module "mysql" - https://phabricator.wikimedia.org/T165625#3362051 (10jcrespo) [06:49:29] 10Operations, 10DBA, 10Patch-For-Review: Adapt wmf-mariadb101 package for stretch and adapt its service to systemd - https://phabricator.wikimedia.org/T116903#3362052 (10jcrespo) [06:49:32] 10Operations, 10DBA, 10Patch-For-Review: mysql user and group should be a system user/group - https://phabricator.wikimedia.org/T100501#3362053 (10jcrespo) [06:49:35] 10Operations, 10DBA: Prepare mysql hosts for stretch - https://phabricator.wikimedia.org/T168356#3362050 (10jcrespo) [06:55:16] (03PS1) 10Jcrespo: install_server: reimage db2072 as stretch [puppet] - 10https://gerrit.wikimedia.org/r/360167 (https://phabricator.wikimedia.org/T168356) [06:55:55] (03PS2) 10Jcrespo: install_server: reimage db2072 as stretch [puppet] - 10https://gerrit.wikimedia.org/r/360167 (https://phabricator.wikimedia.org/T168356) [06:57:10] !log install remaining exim security updates [06:57:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:58:25] (03CR) 10ArielGlenn: [C: 032] script to batch 7z recompress revision content history files manually [dumps] - 10https://gerrit.wikimedia.org/r/359907 (https://phabricator.wikimedia.org/T168223) (owner: 10ArielGlenn) [07:01:09] (03CR) 10Zhuyifei1999: "Bump. /me wants to use fonts" [puppet] - 10https://gerrit.wikimedia.org/r/357878 (https://phabricator.wikimedia.org/T110027) (owner: 10Zhuyifei1999) [07:02:16] (03PS4) 10Giuseppe Lavagetto: role::lvs::balancer: also manage interface tagging [puppet] - 10https://gerrit.wikimedia.org/r/358027 [07:04:39] (03CR) 10ArielGlenn: [C: 032] script for retrieving raw flow revision content [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/357873 (owner: 10ArielGlenn) [07:05:50] (03PS1) 10Jcrespo: mariadb: move db2072 to s1 shard (enwiki) [puppet] - 10https://gerrit.wikimedia.org/r/360172 (https://phabricator.wikimedia.org/T168356) [07:06:11] jouncebot: next [07:06:11] In 0 hour(s) and 53 minute(s): Thumbor (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170620T0800) [07:09:43] (03PS5) 10Giuseppe Lavagetto: role::lvs::balancer: also manage interface tagging [puppet] - 10https://gerrit.wikimedia.org/r/358027 [07:10:45] !log Deploy alter table s5 - db1095 - T166207 [07:10:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:10:55] T166207: Convert unique keys into primary keys for some wiki tables on s5 - https://phabricator.wikimedia.org/T166207 [07:13:24] (03PS1) 10Marostegui: db-eqiad.php: Depool db1071 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/360214 (https://phabricator.wikimedia.org/T166207) [07:14:50] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1071 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/360214 (https://phabricator.wikimedia.org/T166207) (owner: 10Marostegui) [07:16:45] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1071 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/360214 (https://phabricator.wikimedia.org/T166207) (owner: 10Marostegui) [07:16:54] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1071 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/360214 (https://phabricator.wikimedia.org/T166207) (owner: 10Marostegui) [07:18:34] (03CR) 10Marostegui: [C: 031] install_server: reimage db2072 as stretch [puppet] - 10https://gerrit.wikimedia.org/r/360167 (https://phabricator.wikimedia.org/T168356) (owner: 10Jcrespo) [07:20:49] !log Deploy alter table s5 - db1071 - T166207 [07:20:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:20:59] T166207: Convert unique keys into primary keys for some wiki tables on s5 - https://phabricator.wikimedia.org/T166207 [07:22:12] !log Stop MySQL dbstore2001 for maintenance - T168354 [07:22:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:22:22] T168354: dbstore2001 s5 thread is 6 days delayed - https://phabricator.wikimedia.org/T168354 [07:23:40] !log installing glibc security updates [07:23:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:24:28] (03CR) 10Marostegui: [C: 031] mariadb: move db2072 to s1 shard (enwiki) [puppet] - 10https://gerrit.wikimedia.org/r/360172 (https://phabricator.wikimedia.org/T168356) (owner: 10Jcrespo) [07:27:22] !log kill alter table on enwiki.revision db1047 after running for 13 days - T166452 [07:27:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:27:31] T166452: db1047 has been restarted - needs another restart - https://phabricator.wikimedia.org/T166452 [07:28:12] * elukey is sad for db1047 [07:29:26] 10Operations: logmsgbot needs restarting - https://phabricator.wikimedia.org/T168348#3362162 (10Paladox) Ah https://github.com/wikimedia/puppet/blob/4412d294bcb95bb9ac0b951de59576d13a39b0e0/modules/tcpircbot/manifests/instance.pp#L36 I am unsure how that Param ever worked as I doint see ssl existing in the old... [07:29:27] elukey: XDDDDDDDDD [07:31:17] 10Operations: logmsgbot needs restarting - https://phabricator.wikimedia.org/T168348#3362170 (10Paladox) To use ssl in the new version of python-irc you do https://github.com/jaraco/irc/search?utf8=✓&q=ssl&type= [07:33:51] did 47 crash? [07:34:37] jynus: I am restarting it [07:35:46] !log restarting elastic1017 to validate upgrades [07:35:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:41:49] 10Operations, 10Labs, 10Labs-Infrastructure: Puppet CA: virt1000.wikimedia.org' will expire on 2017-08-15 - https://phabricator.wikimedia.org/T168110#3356455 (10akosiaris) It's not related to the host. It's the Puppet CA itself as @Andrew says. On a random VM created on Mar 26 ``` sudo openssl x509 -noout -... [07:45:20] !log Drop table titlekey from s5 - T164949 [07:45:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:45:30] T164949: Drop titlekey table from all wmf databases - https://phabricator.wikimedia.org/T164949 [07:51:43] (03PS6) 10Giuseppe Lavagetto: role::lvs::balancer: also manage interface tagging [puppet] - 10https://gerrit.wikimedia.org/r/358027 [07:59:39] jouncebot: next [07:59:40] In 0 hour(s) and 0 minute(s): Thumbor (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170620T0800) [08:00:04] godog and gilles: Dear anthropoid, the time has come. Please deploy Thumbor (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170620T0800). [08:00:13] (03PS3) 10Filippo Giunchedi: Deploy Thumbor to all Wikipedias [puppet] - 10https://gerrit.wikimedia.org/r/359931 (https://phabricator.wikimedia.org/T167794) (owner: 10Gilles) [08:00:20] gilles: ^ [08:02:57] (03CR) 10Filippo Giunchedi: [C: 032] Deploy Thumbor to all Wikipedias [puppet] - 10https://gerrit.wikimedia.org/r/359931 (https://phabricator.wikimedia.org/T167794) (owner: 10Gilles) [08:03:10] !log Drop table titlekey from s7 - https://phabricator.wikimedia.org/T164949 [08:06:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:07:12] 10Operations, 10Wikimedia-Logstash, 10Discovery-Search (Current work): logstash mapping mixing up field types - https://phabricator.wikimedia.org/T165137#3362234 (10Gehel) I checked a few servers and I can't see the issue anymore. [08:15:27] !log Drop table titlekey from s4 - T164949 [08:15:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:15:37] T164949: Drop titlekey table from all wmf databases - https://phabricator.wikimedia.org/T164949 [08:22:49] !log Drop table titlekey from s3 - T164949 [08:22:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:22:59] T164949: Drop titlekey table from all wmf databases - https://phabricator.wikimedia.org/T164949 [08:24:45] 10Operations: logmsgbot needs restarting - https://phabricator.wikimedia.org/T168348#3362267 (10Paladox) @akosiaris I know how to fix this but gerrit we ui has started throwing 500 for me when I am creating the change (I am out and not near a pc) [08:25:21] <_joe_> !log manually patching gerrit's systemd unit file to allow more open files [08:25:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:28:27] 10Operations, 10Gerrit: Gerrit is now constently throwing 500 for me when reviewing patches - https://phabricator.wikimedia.org/T168360#3362285 (10Paladox) [08:28:47] 10Operations, 10Gerrit: Gerrit is now constently throwing 500 for me when reviewing patches - https://phabricator.wikimedia.org/T168360#3362299 (10Paladox) p:05Triage>03Unbreak! Setting unbreak as I carnt seem to do anything now. [08:28:49] <_joe_> paladox: that's what I am fixing now [08:29:04] <_joe_> paladox: and it's the fault of the conversion to systemd [08:29:26] 10Operations, 10Gerrit: Gerrit is now constently throwing 500 for me when reviewing patches - https://phabricator.wikimedia.org/T168360#3362303 (10Joe) a:03Joe [08:30:13] 10Operations, 10Gerrit: Gerrit is now constently throwing 500 for me when reviewing patches - https://phabricator.wikimedia.org/T168360#3362285 (10Joe) ``` Caused by: java.nio.file.FileSystemException: .../_1ut5m.nvm: Too many open files ``` I'm fixing by hand the systemd unit and restarting gerrit. [08:30:19] <_joe_> !log restarting gerrit T168360 [08:30:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:30:30] T168360: Gerrit is now constently throwing 500 for me when reviewing patches - https://phabricator.wikimedia.org/T168360 [08:30:59] <_joe_> I hope it will restart [08:31:40] _joe_: works for me now :-) [08:31:54] <_joe_> marostegui: the question is not "works for me now" [08:32:11] <_joe_> the question is "how many things were not checked when converting to systemd [08:33:14] yeah, that for sure [08:35:20] 10Operations, 10Gerrit: Gerrit is now constently throwing 500 for me when reviewing patches - https://phabricator.wikimedia.org/T168360#3362314 (10Joe) So what I did: - raised the LimitNOFile to 60000 manually - didn't bother with all the other ulimits that the shell script tries to set - restarted gerrit -di... [08:35:45] 10Operations, 10Gerrit: Gerrit is now constently throwing 500 for me when reviewing patches - https://phabricator.wikimedia.org/T168360#3362315 (10Joe) p:05Unbreak!>03High a:05Joe>03None [08:35:48] !log roll restart swift-proxy on ms-fe* to pick up thumbor changes [08:35:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:37:21] gilles: ^ [08:38:47] 10Operations, 10Traffic, 10Wikimedia-Blog, 10HTTPS: Change automatic shortlink in blog theme - https://phabricator.wikimedia.org/T165511#3362318 (10Volker_E) That's what I expected. The shortlink didn't seem to be reason for the error. As I've said, I didn't have access to the error log. Even with it abov... [08:39:19] (03PS1) 10Filippo Giunchedi: hieradata: have thumbor.svc alert critical [puppet] - 10https://gerrit.wikimedia.org/r/360307 (https://phabricator.wikimedia.org/T121388) [08:44:47] (03CR) 10Filippo Giunchedi: [C: 032] hieradata: have thumbor.svc alert critical [puppet] - 10https://gerrit.wikimedia.org/r/360307 (https://phabricator.wikimedia.org/T121388) (owner: 10Filippo Giunchedi) [08:46:04] !log Drop table titlekey from s1 - T164949 [08:46:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:46:14] T164949: Drop titlekey table from all wmf databases - https://phabricator.wikimedia.org/T164949 [09:03:24] 10Operations, 10DBA: Drop wikilove_image_log table from Wikimedia wikis - https://phabricator.wikimedia.org/T127219#2036092 (10Marostegui) Current status - it looks like it has been partially deleted (or was never placed) on some wikis: s1: ``` db1052.eqiad.wmnet -rw-rw---- 1 mysql mysql 11M Jan 14 2015 /srv... [09:05:35] (03PS10) 10Giuseppe Lavagetto: role::lvs::balancer: also manage interface tagging [puppet] - 10https://gerrit.wikimedia.org/r/358027 [09:15:29] (03PS11) 10Giuseppe Lavagetto: role::lvs::balancer: also manage interface tagging [puppet] - 10https://gerrit.wikimedia.org/r/358027 [09:21:10] 10Operations, 10Gerrit: Gerrit constantly throws HTTP 500 error when reviewing patches (due to "Too many open files") - https://phabricator.wikimedia.org/T168360#3362470 (10Aklapper) [09:27:54] (03CR) 10Giuseppe Lavagetto: [C: 031] "https://puppet-compiler.wmflabs.org/6820/ shows no differences in the applied catalog happen." [puppet] - 10https://gerrit.wikimedia.org/r/358027 (owner: 10Giuseppe Lavagetto) [09:28:51] (03PS1) 10Gilles: Increase Swift timeout for Thumbor [puppet] - 10https://gerrit.wikimedia.org/r/360309 (https://phabricator.wikimedia.org/T121388) [09:29:51] !log Rename table on db1089 enwiki.wikilove_image_log - T127219 [09:30:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:30:01] T127219: Drop wikilove_image_log table from Wikimedia wikis - https://phabricator.wikimedia.org/T127219 [09:30:42] (03PS12) 10Giuseppe Lavagetto: role::lvs::balancer: also manage interface tagging [puppet] - 10https://gerrit.wikimedia.org/r/358027 [09:32:34] (03CR) 10Giuseppe Lavagetto: [C: 032] role::lvs::balancer: also manage interface tagging [puppet] - 10https://gerrit.wikimedia.org/r/358027 (owner: 10Giuseppe Lavagetto) [09:33:58] (03CR) 10Jcrespo: [C: 032] install_server: reimage db2072 as stretch [puppet] - 10https://gerrit.wikimedia.org/r/360167 (https://phabricator.wikimedia.org/T168356) (owner: 10Jcrespo) [09:34:02] (03PS3) 10Jcrespo: install_server: reimage db2072 as stretch [puppet] - 10https://gerrit.wikimedia.org/r/360167 (https://phabricator.wikimedia.org/T168356) [09:35:05] 10Operations, 10DBA: Drop wikilove_image_log table from Wikimedia wikis - https://phabricator.wikimedia.org/T127219#3362495 (10Marostegui) I have taken a backup of this tables at: ``` dbstore1001:/srv/tmp/T127219 ``` It is tiny really: ``` root@dbstore1001:/srv/tmp/T127219# pwd /srv/tmp/T127219 root@dbstore... [09:35:14] (03PS1) 10Giuseppe Lavagetto: profile::lvs::tagged_interface: remove debug warnings [puppet] - 10https://gerrit.wikimedia.org/r/360310 [09:35:31] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] profile::lvs::tagged_interface: remove debug warnings [puppet] - 10https://gerrit.wikimedia.org/r/360310 (owner: 10Giuseppe Lavagetto) [09:36:05] (03PS4) 10Jcrespo: install_server: reimage db2072 as stretch [puppet] - 10https://gerrit.wikimedia.org/r/360167 (https://phabricator.wikimedia.org/T168356) [09:37:51] (03CR) 10Filippo Giunchedi: [C: 032] Increase Swift timeout for Thumbor [puppet] - 10https://gerrit.wikimedia.org/r/360309 (https://phabricator.wikimedia.org/T121388) (owner: 10Gilles) [09:37:56] (03PS2) 10Filippo Giunchedi: Increase Swift timeout for Thumbor [puppet] - 10https://gerrit.wikimedia.org/r/360309 (https://phabricator.wikimedia.org/T121388) (owner: 10Gilles) [09:38:09] hi [09:38:20] (03CR) 10Filippo Giunchedi: [V: 032 C: 032] Increase Swift timeout for Thumbor [puppet] - 10https://gerrit.wikimedia.org/r/360309 (https://phabricator.wikimedia.org/T121388) (owner: 10Gilles) [09:38:52] I am trying to rename https://commons.wikimedia.org/wiki/File:New_York_1911_cut.webm over https://commons.wikimedia.org/wiki/File:New_York_1911.webm [09:39:21] the target was deleted, but then I get "You do not have permission to move this page, for the following reason: [09:39:22] An unknown error occurred in storage backend "local-swift-eqiad"." [09:40:08] trying to delete the target separately doesn't work either [09:40:46] !log roll-restart thumbor to increase swift timeout [09:40:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:41:23] yannf: did the page report an id or sth like that for the exception? [09:41:26] (03PS5) 10Jcrespo: install_server: reimage db2072 as stretch [puppet] - 10https://gerrit.wikimedia.org/r/360167 (https://phabricator.wikimedia.org/T168356) [09:41:41] godog, no [09:46:16] !log rebooting app server canaries for kernel update [09:46:18] yannf: ack, I can't find anything obvious now in logstash for those two files, could you open a task or maybe there are similar tasks already? [09:46:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:53:01] !log reset ms-be1014 idrac via ipmitool [09:53:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:54:08] ok done https://phabricator.wikimedia.org/T168374 [09:55:17] thanks! [09:55:22] (03CR) 10Giuseppe Lavagetto: [C: 031] wmflib: cleanup secret.rb a little bit [puppet] - 10https://gerrit.wikimedia.org/r/359449 (owner: 10Faidon Liambotis) [09:58:16] 10Operations, 10Wikimedia-IRC-RC-Server, 10User-notice: Reboot irc.wikimedia.org for kernel upgrades - https://phabricator.wikimedia.org/T167643#3362598 (10akosiaris) The date is no longer tentative. It's now fixed. [09:58:53] 10Operations, 10Gerrit: Gerrit constantly throws HTTP 500 error when reviewing patches (due to "Too many open files") - https://phabricator.wikimedia.org/T168360#3362599 (10Paladox) Ah, systemd caused this? Oh I thought gerrit was started by the init script as we had problems with systemd before. Also it was... [10:00:42] !log reimage ms-be1016 with stretch [10:00:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:01:02] (03Draft1) 10Paladox: Fix systemd script to use a higher LimitNOFile value [debs/gerrit] - 10https://gerrit.wikimedia.org/r/360312 [10:01:29] (03PS2) 10Paladox: Fix systemd script to use a higher LimitNOFile value [debs/gerrit] - 10https://gerrit.wikimedia.org/r/360312 (https://phabricator.wikimedia.org/T168360) [10:03:33] !log reboot kafka1012, analytics1028, aqs1004 for kernel upgrades (canary hosts) [10:03:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:03:58] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "Apart from my inline comments:" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/359447 (owner: 10Faidon Liambotis) [10:04:13] (03CR) 10Giuseppe Lavagetto: [C: 031] Kill module puppet_statsd [puppet] - 10https://gerrit.wikimedia.org/r/359448 (owner: 10Faidon Liambotis) [10:06:45] (03CR) 10Giuseppe Lavagetto: [C: 031] "Change is correct, but I don't like this "fix" a bit" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/359450 (owner: 10Faidon Liambotis) [10:07:06] !log rebooting mwdebug servers for kernel update [10:07:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:08:13] (03PS2) 10Jcrespo: mariadb: move db2072 to s1 shard (enwiki) [puppet] - 10https://gerrit.wikimedia.org/r/360172 (https://phabricator.wikimedia.org/T168356) [10:11:42] (03CR) 10Jcrespo: [C: 032] mariadb: move db2072 to s1 shard (enwiki) [puppet] - 10https://gerrit.wikimedia.org/r/360172 (https://phabricator.wikimedia.org/T168356) (owner: 10Jcrespo) [10:13:18] (03PS1) 10Filippo Giunchedi: Repurpose mw123[67] as thumbor100[34] [dns] - 10https://gerrit.wikimedia.org/r/360316 (https://phabricator.wikimedia.org/T168297) [10:13:43] (03Draft1) 10Paladox: tcpircbot: fix bot to support the new way of doing ssl [puppet] - 10https://gerrit.wikimedia.org/r/360315 (https://phabricator.wikimedia.org/T168348) [10:13:48] (03PS2) 10Paladox: tcpircbot: fix bot to support the new way of doing ssl [puppet] - 10https://gerrit.wikimedia.org/r/360315 (https://phabricator.wikimedia.org/T168348) [10:15:04] (03CR) 10jerkins-bot: [V: 04-1] tcpircbot: fix bot to support the new way of doing ssl [puppet] - 10https://gerrit.wikimedia.org/r/360315 (https://phabricator.wikimedia.org/T168348) (owner: 10Paladox) [10:16:06] (03PS3) 10Paladox: tcpircbot: fix bot to support the new way of doing ssl [puppet] - 10https://gerrit.wikimedia.org/r/360315 (https://phabricator.wikimedia.org/T168348) [10:17:21] (03PS1) 10Filippo Giunchedi: Repurpose mw123[67] as thumbor100[34] [puppet] - 10https://gerrit.wikimedia.org/r/360317 (https://phabricator.wikimedia.org/T168297) [10:18:56] _joe_: https://gerrit.wikimedia.org/r/#/c/360316/ and https://gerrit.wikimedia.org/r/#/c/360317/ should do it [10:20:56] (03PS1) 10Marostegui: s1.hosts: Add db2072 to s1 [software] - 10https://gerrit.wikimedia.org/r/360318 (https://phabricator.wikimedia.org/T168356) [10:23:07] (03CR) 10Paladox: "I think this will only work on python 3 but not sure" [puppet] - 10https://gerrit.wikimedia.org/r/360315 (https://phabricator.wikimedia.org/T168348) (owner: 10Paladox) [10:25:36] (03CR) 10Paladox: "https://stackoverflow.com/questions/22387651/wrap-socket-got-an-unexpected-keyword-argument-server-hostname" [puppet] - 10https://gerrit.wikimedia.org/r/360315 (https://phabricator.wikimedia.org/T168348) (owner: 10Paladox) [10:26:25] (03CR) 10Marostegui: "@jcrespo feel free to merge this whenever you consider db2072 is ready" [software] - 10https://gerrit.wikimedia.org/r/360318 (https://phabricator.wikimedia.org/T168356) (owner: 10Marostegui) [10:26:52] 10Operations, 10ops-eqiad: IPMI console not working on ms-be1014 / ms-be1015 - https://phabricator.wikimedia.org/T168378#3362681 (10fgiunchedi) [10:31:46] 10Operations, 10MediaWiki-extensions-PageAssessments, 10Wikimedia-General-or-Unknown, 10Patch-For-Review: foreachwikiindblist regular cronspam - https://phabricator.wikimedia.org/T159438#3362717 (10jcrespo) I can confirm this didn't happen tonight CC @faidon [10:33:51] <_joe_> godog: ok, looking at them [10:38:55] finally success [10:39:02] <_joe_> ahahah [10:39:17] <_joe_> I was asking myself who did that, and I had no doubts, really [10:39:26] I got to know python-irc way way better than I ever wanted to [10:39:45] :D [10:41:05] (03CR) 10Giuseppe Lavagetto: [C: 031] Repurpose mw123[67] as thumbor100[34] [dns] - 10https://gerrit.wikimedia.org/r/360316 (https://phabricator.wikimedia.org/T168297) (owner: 10Filippo Giunchedi) [10:42:27] (03CR) 10Giuseppe Lavagetto: [C: 031] Repurpose mw123[67] as thumbor100[34] [puppet] - 10https://gerrit.wikimedia.org/r/360317 (https://phabricator.wikimedia.org/T168297) (owner: 10Filippo Giunchedi) [10:44:57] (03PS1) 10Alexandros Kosiaris: tcpircbot: Update it to work with 8.5.3 irc library [puppet] - 10https://gerrit.wikimedia.org/r/360325 (https://phabricator.wikimedia.org/T168348) [10:46:42] (03CR) 10Alexandros Kosiaris: [C: 04-2] "See a way of doing this without involving puppet templates in https://gerrit.wikimedia.org/r/#/c/360325/1" [puppet] - 10https://gerrit.wikimedia.org/r/360315 (https://phabricator.wikimedia.org/T168348) (owner: 10Paladox) [10:50:43] (03PS2) 10Filippo Giunchedi: Repurpose mw123[67] as thumbor100[34] [puppet] - 10https://gerrit.wikimedia.org/r/360317 (https://phabricator.wikimedia.org/T168297) [10:53:01] (03CR) 10Alexandros Kosiaris: [C: 032] tcpircbot: Update it to work with 8.5.3 irc library [puppet] - 10https://gerrit.wikimedia.org/r/360325 (https://phabricator.wikimedia.org/T168348) (owner: 10Alexandros Kosiaris) [10:54:02] (03CR) 10Filippo Giunchedi: [C: 032] Repurpose mw123[67] as thumbor100[34] [puppet] - 10https://gerrit.wikimedia.org/r/360317 (https://phabricator.wikimedia.org/T168297) (owner: 10Filippo Giunchedi) [10:54:08] (03PS3) 10Filippo Giunchedi: Repurpose mw123[67] as thumbor100[34] [puppet] - 10https://gerrit.wikimedia.org/r/360317 (https://phabricator.wikimedia.org/T168297) [10:56:36] welcome back logmsgbot [10:56:41] !log akosiaris@puppetmaster1001 conftool action : set/pooled=inactive; selector: name=sca1004.eqiad.wmnet [10:56:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:00:17] 10Operations, 10Patch-For-Review: logmsgbot needs restarting - https://phabricator.wikimedia.org/T168348#3362840 (10akosiaris) 05Open>03Resolved a:03akosiaris 1d48aedcf3be272 fixed the issues we had and logmsgbot is running fine again. Resolving this successfully [11:00:30] (03CR) 10Filippo Giunchedi: [C: 032] Repurpose mw123[67] as thumbor100[34] [dns] - 10https://gerrit.wikimedia.org/r/360316 (https://phabricator.wikimedia.org/T168297) (owner: 10Filippo Giunchedi) [11:08:19] !log akosiaris@puppetmaster1001 conftool action : set/pooled=inactive; selector: name=mwdebug1002.eqiad.wmnet [11:08:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:09:06] (03PS1) 10Alexandros Kosiaris: Renumber sca1004, mwdebug1002 [dns] - 10https://gerrit.wikimedia.org/r/360326 [11:11:07] (03CR) 10Giuseppe Lavagetto: [C: 04-2] "Without looking further and discarding all the other doubts I have about the code (volans' review is a good start, but not the end of it)," (035 comments) [software/certpy] - 10https://gerrit.wikimedia.org/r/359960 (https://phabricator.wikimedia.org/T166167) (owner: 10Ottomata) [11:13:44] !log renumber sca1004, mwdebug1002. Downtime should be a few minutes [11:13:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:29:18] !log rebooting mediawiki app servers in codfw for kernel update [11:29:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:36:06] !log installing libgcrypt security updates [11:36:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:53:11] 10Operations, 10Gerrit, 10Patch-For-Review: Gerrit constantly throws HTTP 500 error when reviewing patches (due to "Too many open files") - https://phabricator.wikimedia.org/T168360#3362936 (10Paladox) @joe thanks for fixing this :) [12:00:52] !log reboot analytics1029 -> analytics1069 for kernel upgrades (Hadoop worker nodes) [12:01:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:05:20] (03CR) 10Alexandros Kosiaris: [C: 032] Renumber sca1004, mwdebug1002 [dns] - 10https://gerrit.wikimedia.org/r/360326 (owner: 10Alexandros Kosiaris) [12:09:00] !log starting cluster restart elasticsearch eqiad [12:09:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:10:06] (03CR) 10Faidon Liambotis: "No, this is completely untested." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/359447 (owner: 10Faidon Liambotis) [12:10:17] gehel: have fun! :P [12:11:32] elukey: I will! [12:19:42] (03Abandoned) 10Paladox: tcpircbot: fix bot to support the new way of doing ssl [puppet] - 10https://gerrit.wikimedia.org/r/360315 (https://phabricator.wikimedia.org/T168348) (owner: 10Paladox) [12:24:39] 10Operations: logmsgbot needs restarting - https://phabricator.wikimedia.org/T168348#3363024 (10Paladox) [12:31:57] (03CR) 10Hashar: "I have made git_changed_in_head() to cache the result of the git command in https://gerrit.wikimedia.org/r/#/c/359951/" [puppet] - 10https://gerrit.wikimedia.org/r/357804 (https://phabricator.wikimedia.org/T166888) (owner: 10Hashar) [12:36:14] (03PS2) 10Hashar: tests: disable ruby output buffering [puppet] - 10https://gerrit.wikimedia.org/r/359457 [12:36:16] (03PS2) 10Hashar: Rake: memoize git_changed_in_head() [puppet] - 10https://gerrit.wikimedia.org/r/359951 (https://phabricator.wikimedia.org/T166888) [12:36:18] (03PS4) 10Hashar: Rake: optimize typos task for CI [puppet] - 10https://gerrit.wikimedia.org/r/357804 (https://phabricator.wikimedia.org/T166888) [12:36:21] (03PS1) 10Marostegui: db-eqiad.php: Make db1060 s2 sanitarium2 master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/360329 (https://phabricator.wikimedia.org/T153743) [12:37:37] 10Operations, 10Traffic, 10Wikimedia-Blog, 10HTTPS: Change automatic shortlink in blog theme - https://phabricator.wikimedia.org/T165511#3363110 (10ema) [12:48:46] (03PS1) 10Alexandros Kosiaris: ircecho: Re-tab the entire file [puppet] - 10https://gerrit.wikimedia.org/r/360331 [12:48:48] (03PS1) 10Alexandros Kosiaris: ircecho: Make flake8 compliant [puppet] - 10https://gerrit.wikimedia.org/r/360332 [12:48:50] (03PS1) 10Alexandros Kosiaris: irecho: Fix 510 chars error [puppet] - 10https://gerrit.wikimedia.org/r/360333 [12:55:22] (03PS1) 10Alexandros Kosiaris: Give ircecho a .py extension in the puppet repo [puppet] - 10https://gerrit.wikimedia.org/r/360335 [12:57:41] testing [12:59:08] PROBLEM - Apache HTTP on mw1236 is CRITICAL: connect to address 10.64.48.71 and port 80: Connection refused [12:59:18] PROBLEM - HHVM processes on mw1236 is CRITICAL: NRPE: Command check_hhvm not defined [12:59:29] PROBLEM - Nginx local proxy to apache on mw1236 is CRITICAL: connect to address 10.64.48.71 and port 443: Connection refused [12:59:29] PROBLEM - puppet last run on mw1237 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 16 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[python-thumbor-wikimedia] [12:59:29] PROBLEM - puppet last run on mw1236 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 15 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[python-thumbor-wikimedia] [12:59:29] (03PS1) 10Aude: Enable sitelinks on Wikidata for Wiktionary pages outside main namespace (phase 1) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/360336 (https://phabricator.wikimedia.org/T158323) [12:59:48] PROBLEM - Check systemd state on mw1237 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [12:59:58] PROBLEM - HHVM processes on mw1237 is CRITICAL: NRPE: Command check_hhvm not defined [12:59:58] PROBLEM - HHVM rendering on mw1237 is CRITICAL: connect to address 10.64.48.72 and port 80: Connection refused [12:59:58] PROBLEM - Nginx local proxy to apache on mw1237 is CRITICAL: connect to address 10.64.48.72 and port 443: Connection refused [12:59:58] PROBLEM - DPKG on mw1236 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [12:59:59] PROBLEM - Check systemd state on mw1236 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [12:59:59] PROBLEM - HHVM rendering on mw1236 is CRITICAL: connect to address 10.64.48.71 and port 80: Connection refused [13:00:04] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Dear anthropoid, the time has come. Please deploy European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170620T1300). [13:00:08] RECOVERY - DPKG on thumbor1004 is OK: All packages OK [13:00:09] PROBLEM - Apache HTTP on mw1237 is CRITICAL: connect to address 10.64.48.72 and port 80: Connection refused [13:00:22] hi [13:00:28] PROBLEM - Disk space on krypton is CRITICAL: DISK CRITICAL - /var/spool/exim4/db is not accessible: Permission denied [13:00:58] RECOVERY - DPKG on mw1236 is OK: All packages OK [13:01:05] godog: some weirdness on mw123[67] :) [13:01:12] its repurposed [13:01:14] to thumbor [13:01:22] yeah, just pinged him as fyi [13:01:47] odd, thanks elukey [13:02:27] ah yeah I guess because puppet is or has been stopped on einsteinium [13:02:29] RECOVERY - Disk space on krypton is OK: DISK OK [13:02:41] yes I have disabled it [13:02:46] fighting with ircecho bugs [13:03:27] :( [13:04:28] RECOVERY - puppet last run on thumbor1004 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [13:04:29] RECOVERY - puppet last run on mw1236 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [13:04:29] RECOVERY - puppet last run on mw1237 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [13:04:38] RECOVERY - thumbor@8808 service on thumbor1004 is OK: OK - thumbor@8808 is active [13:04:38] RECOVERY - thumbor@8825 service on thumbor1004 is OK: OK - thumbor@8825 is active [13:04:38] RECOVERY - thumbor@8815 service on thumbor1004 is OK: OK - thumbor@8815 is active [13:04:38] RECOVERY - thumbor@8832 service on thumbor1004 is OK: OK - thumbor@8832 is active [13:04:38] RECOVERY - thumbor@8801 service on thumbor1004 is OK: OK - thumbor@8801 is active [13:04:39] RECOVERY - thumbor@8818 service on thumbor1004 is OK: OK - thumbor@8818 is active [13:04:48] RECOVERY - thumbor@8822 service on thumbor1004 is OK: OK - thumbor@8822 is active [13:04:48] RECOVERY - thumbor@8805 service on thumbor1004 is OK: OK - thumbor@8805 is active [13:04:48] RECOVERY - thumbor@8829 service on thumbor1004 is OK: OK - thumbor@8829 is active [13:04:48] RECOVERY - thumbor@8812 service on thumbor1004 is OK: OK - thumbor@8812 is active [13:04:58] RECOVERY - thumbor@8802 service on thumbor1004 is OK: OK - thumbor@8802 is active [13:04:59] RECOVERY - thumbor@8819 service on thumbor1004 is OK: OK - thumbor@8819 is active [13:04:59] RECOVERY - thumbor@8809 service on thumbor1004 is OK: OK - thumbor@8809 is active [13:04:59] RECOVERY - thumbor@8811 service on thumbor1004 is OK: OK - thumbor@8811 is active [13:04:59] RECOVERY - thumbor@8826 service on thumbor1004 is OK: OK - thumbor@8826 is active [13:04:59] RECOVERY - thumbor@8816 service on thumbor1004 is OK: OK - thumbor@8816 is active [13:04:59] RECOVERY - Check systemd state on mw1236 is OK: OK - unknown: The operational state could not be determined, due to lack of resources or another error cause. [13:05:18] Messages limited to 512 bytes including CR/LF [13:05:27] aude: is there anything else lined up for SWAT (since the wikitech page is empty)? I'm currently rebooting application servers for a kernel update and if there's nothing to be deployed I would continue with that [13:05:28] ok... doesn't this mean my message should be 510 bytes ? [13:05:28] sorry about thumbor, silencing [13:06:18] moritzm: how long does it take? [13:06:42] several days [13:06:44] ;-) [13:06:48] PROBLEM - Host thumbor1004 is DOWN: PING CRITICAL - Packet loss = 100% [13:07:48] RECOVERY - Check systemd state on mw1237 is OK: OK - running: The system is fully operational [13:07:48] RECOVERY - Host thumbor1004 is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms [13:08:08] RECOVERY - thumbor@8830 service on thumbor1004 is OK: OK - thumbor@8830 is active [13:08:08] RECOVERY - thumbor@8813 service on thumbor1004 is OK: OK - thumbor@8813 is active [13:08:08] RECOVERY - thumbor@8820 service on thumbor1004 is OK: OK - thumbor@8820 is active [13:08:09] RECOVERY - thumbor@8803 service on thumbor1004 is OK: OK - thumbor@8803 is active [13:08:18] RECOVERY - Check systemd state on thumbor1004 is OK: OK - running: The system is fully operational [13:08:18] RECOVERY - thumbor@8827 service on thumbor1004 is OK: OK - thumbor@8827 is active [13:08:18] RECOVERY - thumbor@8810 service on thumbor1004 is OK: OK - thumbor@8810 is active [13:08:18] RECOVERY - thumbor@8817 service on thumbor1004 is OK: OK - thumbor@8817 is active [13:08:18] RECOVERY - thumbor@8821 service on thumbor1004 is OK: OK - thumbor@8821 is active [13:08:18] RECOVERY - thumbor@8807 service on thumbor1004 is OK: OK - thumbor@8807 is active [13:08:18] RECOVERY - thumbor@8824 service on thumbor1004 is OK: OK - thumbor@8824 is active [13:08:19] RECOVERY - thumbor@8831 service on thumbor1004 is OK: OK - thumbor@8831 is active [13:08:19] RECOVERY - thumbor@8814 service on thumbor1004 is OK: OK - thumbor@8814 is active [13:08:28] RECOVERY - thumbor@8804 service on thumbor1004 is OK: OK - thumbor@8804 is active [13:08:28] RECOVERY - thumbor@8828 service on thumbor1004 is OK: OK - thumbor@8828 is active [13:08:29] i want to take care of https://phabricator.wikimedia.org/T158325 sometime [13:11:53] aaaaa [13:12:21] sorry about what's going to follow [13:12:41] hmmm [13:12:50] so that 512 must mean the entire IRC message [13:13:01] including not just CR/LF but everything else in the IRC protocol [13:13:48] PROBLEM - nutcracker process on mwdebug1002 is CRITICAL: NRPE: Command check_nutcracker not defined [13:14:01] !log rebooting restbase staging cluster (cerium/praseodymium/xenon) for kernel update [13:14:08] PROBLEM - zotero on sca1004 is CRITICAL: connect to address 10.64.0.46 and port 1969: Connection refused [13:14:10] PROBLEM - Check systemd state on mwdebug1002 is CRITICAL: NRPE: Command check_check_systemd_state not defined [13:14:10] PROBLEM - HHVM processes on mwdebug1002 is CRITICAL: NRPE: Command check_hhvm not defined [13:14:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:14:18] PROBLEM - Nginx local proxy to apache on mwdebug1002 is CRITICAL: connect to address 10.64.0.47 and port 443: Connection refused [13:14:18] PROBLEM - Apache HTTP on mwdebug1002 is CRITICAL: connect to address 10.64.0.47 and port 80: Connection refused [13:14:18] PROBLEM - HHVM rendering on mwdebug1002 is CRITICAL: connect to address 10.64.0.47 and port 80: Connection refused [13:14:18] PROBLEM - nutcracker port on mwdebug1002 is CRITICAL: NRPE: Command check_nutcracker_port not defined [13:20:00] (03CR) 10Gehel: [C: 031] "If I understand correctly, we also need to add a scap3 config to the plugin repo itself..." [puppet] - 10https://gerrit.wikimedia.org/r/354472 (https://phabricator.wikimedia.org/T165748) (owner: 10Thcipriani) [13:22:51] 10Operations, 10Wikimedia-IRC-RC-Server, 10User-notice: Reboot irc.wikimedia.org for kernel upgrades - https://phabricator.wikimedia.org/T167643#3363260 (10Johan) It was included in the issue of Tech News that went out yesterday. [13:23:38] moritzm: would it be ok if i deploy my patch sometime before puppet swat? [13:23:50] otherwise, maybe i can do it before/after evening swat [13:24:48] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1071" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/360342 [13:26:52] !log Deploy alter table labsdb1010 - s5 - T166207 [13:27:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:27:02] T166207: Convert unique keys into primary keys for some wiki tables on s5 - https://phabricator.wikimedia.org/T166207 [13:27:39] (03PS2) 10Aude: Enable sitelinks on Wikidata for Wiktionary pages outside main namespace (phase 1) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/360336 (https://phabricator.wikimedia.org/T158323) [13:27:41] aude: sure, you can start now if you want. I can stop the reboots for now and will simply pick up when you're done [13:27:59] ok [13:28:09] won't take that long [13:28:56] 10Operations, 10Discovery, 10Icinga, 10Maps, and 2 others: Create Icinga alert when OSM replication lags on maps - https://phabricator.wikimedia.org/T167549#3363276 (10Gehel) [13:29:14] (03PS2) 10Alexandros Kosiaris: irecho: Avoid MessageTooLong error [puppet] - 10https://gerrit.wikimedia.org/r/360333 [13:29:16] (03PS2) 10Alexandros Kosiaris: Give ircecho a .py extension in the puppet repo [puppet] - 10https://gerrit.wikimedia.org/r/360335 [13:29:19] 10Operations, 10Discovery, 10Icinga, 10Maps, and 2 others: Create Icinga alert when OSM replication lags on maps - https://phabricator.wikimedia.org/T167549#3336458 (10Gehel) The current issues need to be fixed before we can activate any alert. [13:29:58] aude: let me know when done, so I can deploy a config change for db-eqiad.php :) [13:30:35] ok [13:31:37] aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa [13:31:47] aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa [13:32:11] lol [13:32:15] (03PS1) 10Aude: Enable Wiktionary site links on test.wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/360343 (https://phabricator.wikimedia.org/T158325) [13:32:17] how on earth is that 512 bytes message printed... I don't think I 've ever seen icinga-wm spit out something that large [13:32:25] probably was ignored/muted by the irc servers [13:32:31] aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa [13:32:37] ok 450 it is [13:32:43] (03CR) 10Aude: [C: 032] Enable Wiktionary site links on test.wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/360343 (https://phabricator.wikimedia.org/T158325) (owner: 10Aude) [13:32:44] looks good enough [13:32:50] (03CR) 10EBernhardson: [C: 031] "the .deb package was for elasticsearch plugins, this is logstash. might also make sense but i haven't thought about it much." [puppet] - 10https://gerrit.wikimedia.org/r/354472 (https://phabricator.wikimedia.org/T165748) (owner: 10Thcipriani) [13:32:52] one final test [13:32:56] aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa [13:33:53] !log pool thumbor100[34] into service - T168297 [13:33:57] going to restart the whole CI servers over the next half hour or so [13:34:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:34:03] T168297: Reimage eqiad imagescalers to be used with thumbor - https://phabricator.wikimedia.org/T168297 [13:34:11] (03CR) 10Alexandros Kosiaris: [C: 032] ircecho: Re-tab the entire file [puppet] - 10https://gerrit.wikimedia.org/r/360331 (owner: 10Alexandros Kosiaris) [13:34:18] (03CR) 10Alexandros Kosiaris: [C: 032] ircecho: Make flake8 compliant [puppet] - 10https://gerrit.wikimedia.org/r/360332 (owner: 10Alexandros Kosiaris) [13:34:22] (03CR) 10Alexandros Kosiaris: [C: 032] irecho: Avoid MessageTooLong error [puppet] - 10https://gerrit.wikimedia.org/r/360333 (owner: 10Alexandros Kosiaris) [13:34:28] (03Merged) 10jenkins-bot: Enable Wiktionary site links on test.wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/360343 (https://phabricator.wikimedia.org/T158325) (owner: 10Aude) [13:34:28] PROBLEM - Check the NTP synchronisation status of timesyncd on mwdebug1002 is CRITICAL: NRPE: Command check_timesynd_ntp_status not defined [13:34:30] (03CR) 10Alexandros Kosiaris: [C: 032] Give ircecho a .py extension in the puppet repo [puppet] - 10https://gerrit.wikimedia.org/r/360335 (owner: 10Alexandros Kosiaris) [13:35:51] testing on mwdebug [13:36:03] (03CR) 10jenkins-bot: Enable Wiktionary site links on test.wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/360343 (https://phabricator.wikimedia.org/T158325) (owner: 10Aude) [13:36:19] (03PS1) 10Alexandros Kosiaris: irecho: Actually use 450 instead of 510 [puppet] - 10https://gerrit.wikimedia.org/r/360344 [13:36:29] !log akosiaris@puppetmaster1001 conftool action : set/pooled=true; selector: name=mwdebug1002.eqiad.wmnet [13:36:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:36:40] !log akosiaris@puppetmaster1001 conftool action : set/pooled=true; selector: name=sca1004.eqiad.wmnet [13:36:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:37:07] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] irecho: Actually use 450 instead of 510 [puppet] - 10https://gerrit.wikimedia.org/r/360344 (owner: 10Alexandros Kosiaris) [13:37:46] !log Restarting Jenkins [13:37:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:37:59] (03PS1) 10Aude: Fix siteGroups setting for test.wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/360345 [13:38:38] PROBLEM - NTP on sca1004 is CRITICAL: NTP CRITICAL: No response from NTP server [13:38:54] (03CR) 10Aude: [C: 032] Fix siteGroups setting for test.wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/360345 (owner: 10Aude) [13:39:18] PROBLEM - mediawiki-installation DSH group on mwdebug1002 is CRITICAL: Host mwdebug1002 is not in mediawiki-installation dsh group [13:39:22] 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-Cluster: rack/setup/install new kafka nodes - https://phabricator.wikimedia.org/T167992#3363320 (10Ottomata) We like jumbo! Let's do it. kafka-jumbo100[1-6] [13:39:25] !log rebooting labnodepool1001 for kernel update [13:39:26] 10Operations, 10Discovery, 10Maps, 10monitoring: Map caches metrics look broken - https://phabricator.wikimedia.org/T141186#3363321 (10Gehel) Since ganglia is being phased out, should we just close this task and move on to prometheus? [13:39:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:39:40] 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-Cluster: rack/setup/install new kafka nodes kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T167992#3363322 (10Ottomata) [13:39:47] RECOVERY - nutcracker process on mwdebug1002 is OK: PROCS OK: 1 process with UID = 110 (nutcracker), command name nutcracker [13:39:55] (03Merged) 10jenkins-bot: Fix siteGroups setting for test.wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/360345 (owner: 10Aude) [13:39:56] !log Deploy alter table on db1049 - s5 - T166207 [13:40:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:40:06] T166207: Convert unique keys into primary keys for some wiki tables on s5 - https://phabricator.wikimedia.org/T166207 [13:40:07] RECOVERY - zotero on sca1004 is OK: HTTP OK: HTTP/1.0 200 OK - 62 bytes in 0.007 second response time [13:40:07] RECOVERY - Check systemd state on mwdebug1002 is OK: OK - running: The system is fully operational [13:40:07] RECOVERY - HHVM processes on mwdebug1002 is OK: PROCS OK: 6 processes with command name hhvm [13:40:08] (03CR) 10jenkins-bot: Fix siteGroups setting for test.wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/360345 (owner: 10Aude) [13:40:17] RECOVERY - Apache HTTP on mwdebug1002 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 618 bytes in 0.111 second response time [13:40:17] RECOVERY - Nginx local proxy to apache on mwdebug1002 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 619 bytes in 0.114 second response time [13:40:17] RECOVERY - HHVM rendering on mwdebug1002 is OK: HTTP OK: HTTP/1.1 200 OK - 75679 bytes in 0.187 second response time [13:40:17] RECOVERY - nutcracker port on mwdebug1002 is OK: TCP OK - 0.003 second response time on 127.0.0.1 port 11212 [13:40:41] 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-Cluster: rack/setup/install new kafka nodes kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T167992#3352337 (10Ottomata) [13:40:52] and with that, I think I 've fixed the damn ircecho thing [13:42:17] RECOVERY - swift-container-updater on ms-be1016 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-updater [13:42:17] RECOVERY - swift-object-auditor on ms-be1016 is OK: PROCS OK: 3 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor [13:42:17] RECOVERY - very high load average likely xfs on ms-be1016 is OK: OK - load average: 45.23, 39.87, 37.18 [13:42:18] RECOVERY - Check size of conntrack table on ms-be1016 is OK: OK: nf_conntrack is 6 % full [13:42:18] RECOVERY - swift-account-reaper on ms-be1016 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [13:42:18] RECOVERY - salt-minion processes on ms-be1016 is OK: PROCS OK: 3 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [13:42:27] RECOVERY - swift-object-replicator on ms-be1016 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [13:42:28] <_joe_> !log manually started nrpe on ms-be1016 [13:42:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:42:37] RECOVERY - swift-account-auditor on ms-be1016 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [13:42:37] RECOVERY - Disk space on ms-be1016 is OK: DISK OK [13:42:38] RECOVERY - DPKG on ms-be1016 is OK: All packages OK [13:42:47] RECOVERY - swift-container-replicator on ms-be1016 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [13:42:47] RECOVERY - swift-object-server on ms-be1016 is OK: PROCS OK: 101 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [13:42:47] RECOVERY - Check whether ferm is active by checking the default input chain on ms-be1016 is OK: OK ferm input default policy is set [13:42:57] RECOVERY - swift-account-replicator on ms-be1016 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [13:42:57] RECOVERY - swift-account-server on ms-be1016 is OK: PROCS OK: 41 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [13:42:57] RECOVERY - swift-container-auditor on ms-be1016 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [13:43:07] RECOVERY - swift-container-server on ms-be1016 is OK: PROCS OK: 41 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [13:43:07] RECOVERY - dhclient process on ms-be1016 is OK: PROCS OK: 0 processes with command name dhclient [13:43:07] RECOVERY - swift-object-updater on ms-be1016 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-updater [13:43:07] RECOVERY - configured eth on ms-be1016 is OK: OK - interfaces up [13:44:14] !log aude@tin Synchronized wmf-config/Wikibase-production.php: Enable Wiktionary site links on test.wikidata (duration: 00m 43s) [13:44:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:44:25] think i'm done for now [13:44:46] i have to do wikidata later once i'm sure test.wikidata is good [13:44:51] aude: cool thanks! [13:45:00] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1071" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/360342 (owner: 10Marostegui) [13:46:51] aude: ok, continuing with mw* reboots [13:46:54] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1071" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/360342 (owner: 10Marostegui) [13:46:55] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1071" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/360342 (owner: 10Marostegui) [13:47:30] (03CR) 10Aude: [C: 031] "needs rebase, since adding configs for test.wikidata" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/360336 (https://phabricator.wikimedia.org/T158323) (owner: 10Aude) [13:47:35] (03CR) 10Aude: [C: 04-1] Enable sitelinks on Wikidata for Wiktionary pages outside main namespace (phase 1) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/360336 (https://phabricator.wikimedia.org/T158323) (owner: 10Aude) [13:47:56] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1071 - T166207 (duration: 00m 41s) [13:48:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:48:08] T166207: Convert unique keys into primary keys for some wiki tables on s5 - https://phabricator.wikimedia.org/T166207 [13:48:39] (03PS1) 10Marostegui: db-eqiad.php: Depool db1087 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/360347 (https://phabricator.wikimedia.org/T166207) [13:50:26] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1087 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/360347 (https://phabricator.wikimedia.org/T166207) (owner: 10Marostegui) [13:52:17] PROBLEM - Host elastic1019 is DOWN: PING CRITICAL - Packet loss = 100% [13:53:27] RECOVERY - Host elastic1019 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms [13:53:42] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1087 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/360347 (https://phabricator.wikimedia.org/T166207) (owner: 10Marostegui) [13:54:24] (03PS2) 10Mobrovac: PDF Render: Check hourly if the service is running via cron [puppet] - 10https://gerrit.wikimedia.org/r/359967 (https://phabricator.wikimedia.org/T159922) (owner: 10GWicke) [13:54:35] (03PS1) 10Filippo Giunchedi: thumbor: use jessie-backports as target release for python-thumbor-wikimedia [puppet] - 10https://gerrit.wikimedia.org/r/360350 (https://phabricator.wikimedia.org/T121388) [13:55:02] !log Deploy alter table db1087 - s5 - T166207 [13:55:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:10] T166207: Convert unique keys into primary keys for some wiki tables on s5 - https://phabricator.wikimedia.org/T166207 [13:55:39] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1087 - T166207 (duration: 01m 41s) [13:55:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:56:02] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1087 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/360347 (https://phabricator.wikimedia.org/T166207) (owner: 10Marostegui) [13:57:13] <_joe_> mobrovac: seriously? [13:57:14] <_joe_> :P [13:57:42] _joe_: hard times and desperate measures :) [13:57:47] RECOVERY - IPMI Temperature on ms-be1016 is OK: Sensor Type(s) Temperature Status: OK [13:57:59] (03CR) 10Gehel: "@EBernhardson: thanks! I'm mixing up things. The related change to add scap3 cofniguration to the plugin repo is https://gerrit.wikimedia." [puppet] - 10https://gerrit.wikimedia.org/r/354472 (https://phabricator.wikimedia.org/T165748) (owner: 10Thcipriani) [13:58:06] mobrovac: ciao Marko! I'd need to reboot kafka[12]00[123] [13:58:28] RECOVERY - Check the NTP synchronisation status of timesyncd on ms-be1016 is OK: OK: synced at Tue 2017-06-20 13:58:23 UTC. [13:58:40] ciao elukey, sure, got a time-line in mind? [13:59:18] mobrovac: I could do kafka2001 and kafka1001 today, let the new kernel boil for a day and then finish tomorrow [14:00:02] oh it's about the new kernel? hm ok [14:00:18] mobrovac: combination of kernel and glibc changes in fact [14:00:22] elukey: let's do kafka2001 first, wait a bit and then move onto kafka1001 [14:00:32] moritzm: yup yup, seen yday's chatter :) [14:00:41] mobrovac: sure [14:00:48] !log Stopping Nodepool service to prevent new builds [14:00:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:01:30] (03CR) 10Krinkle: "(More for my own learning than an actual suggestion) Would it not make sense to do this as part of the 'service' / upstart script? Seems l" [puppet] - 10https://gerrit.wikimedia.org/r/359967 (https://phabricator.wikimedia.org/T159922) (owner: 10GWicke) [14:02:57] !log Rebooting contint1001 [14:02:58] RECOVERY - puppet last run on ms-be1016 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [14:03:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:03:17] RECOVERY - MD RAID on ms-be1016 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 [14:03:54] (03PS3) 10Aude: Enable sitelinks on Wikidata for Wiktionary pages outside main namespace (phase 1) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/360336 (https://phabricator.wikimedia.org/T158323) [14:04:27] RECOVERY - Check the NTP synchronisation status of timesyncd on mwdebug1002 is OK: OK: synced at Tue 2017-06-20 14:04:19 UTC. [14:04:58] 10Operations, 10Labs, 10Labs-Infrastructure, 10cloud-services-team (Kanban): Puppet CA: virt1000.wikimedia.org' will expire on 2017-08-15 - https://phabricator.wikimedia.org/T168110#3363428 (10Andrew) [14:05:35] !log reboot kafka2001 for kernel upgrade [14:05:41] !log Starting Jenkins on contint1001 [14:05:43] <_joe_> mobrovac: tbh, re T159922 - I'd rather disable electron on wikis for now [14:05:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:05:47] RECOVERY - HP RAID on ms-be1019 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4 - Controller: OK - Battery/Capacitor: OK [14:05:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:05:52] T159922: pdfrender fails to serve requests since Mar 8 00:30:32 UTC on scb1003 - https://phabricator.wikimedia.org/T159922 [14:06:17] _joe_: but it's already used on dewiki [14:06:30] <_joe_> mobrovac: we can revert that I guess [14:06:37] RECOVERY - Check systemd state on ms-be1016 is OK: OK - running: The system is fully operational [14:06:55] <_joe_> the service doesn't work well, no one is investing time in it at the moment (rightfully so), so let's disable it for now? [14:07:34] (03CR) 10Mobrovac: "@Krinkle, I would agree, but the problem here is that sometimes the restart issued by SystemD leaves the service in a weird state wherein " [puppet] - 10https://gerrit.wikimedia.org/r/359967 (https://phabricator.wikimedia.org/T159922) (owner: 10GWicke) [14:08:05] (03PS1) 10Ayounsi: Depool codfw for asw-d-codfw upgrade [dns] - 10https://gerrit.wikimedia.org/r/360352 (https://phabricator.wikimedia.org/T167274) [14:08:37] _joe_: afaik that would leave the pdf extension dewiki is using without an alternative [14:08:37] RECOVERY - NTP on sca1004 is OK: NTP OK: Offset -0.007870227098 secs [14:08:50] _joe_: that said, i'm not quite sure what that extension does :P [14:08:55] <_joe_> sigh [14:08:58] 10Operations, 10monitoring: Aggregate prometheus functions yielding different results in grafana vs. prometheus console - https://phabricator.wikimedia.org/T168403#3363449 (10ema) [14:09:06] 10Operations, 10monitoring: Aggregate prometheus functions yielding different results in grafana vs. prometheus console - https://phabricator.wikimedia.org/T168403#3363435 (10ema) p:05Triage>03Normal [14:09:25] <_joe_> so we have a second PDF rendere in production with issues [14:09:31] <_joe_> and with no one fixing it [14:09:37] PROBLEM - Check systemd state on ms-be1016 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [14:09:39] <_joe_> and that's not replaceable [14:11:06] _joe_: i will look into the extension and see what we can do, but for now I would suggest to move with https://gerrit.wikimedia.org/r/#/c/359967/ ; it's ugly, but it will improve the situation a bit [14:11:43] <_joe_> I'm not convinced spit and duct tape always make things better ;) [14:11:50] 10Operations, 10monitoring, 10Prometheus-metrics-monitoring: Aggregate prometheus functions yielding different results in grafana vs. prometheus console - https://phabricator.wikimedia.org/T168403#3363453 (10ema) [14:14:34] PROBLEM - cassandra-a SSL 10.64.0.202:7001 on xenon is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [14:14:54] PROBLEM - cassandra-a CQL 10.64.0.202:9042 on xenon is CRITICAL: connect to address 10.64.0.202 and port 9042: Connection refused [14:14:54] PROBLEM - Check systemd state on xenon is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [14:15:15] PROBLEM - cassandra-a service on xenon is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [14:16:44] !log Upgraded Jenkins plugins [14:16:44] RECOVERY - Check systemd state on ms-be1016 is OK: OK - running: The system is fully operational [14:16:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:17:00] !log CI is fully backup (following reboot of contint1001 / labnodepool1001 ) [14:17:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:18:31] (03CR) 10Alexandros Kosiaris: [C: 04-2] "Up to now we have seen multiple times that pdfrender does not get fixed just by restarting it. More than one times the restart has failed " [puppet] - 10https://gerrit.wikimedia.org/r/359967 (https://phabricator.wikimedia.org/T159922) (owner: 10GWicke) [14:19:44] PROBLEM - Check systemd state on ms-be1016 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [14:22:44] RECOVERY - Check systemd state on ms-be1016 is OK: OK - running: The system is fully operational [14:23:12] 10Operations, 10Project-Admins, 10DevRel-February-2016, 10DevRel-March-2016: Operations-related subprojects/tags reorganization - https://phabricator.wikimedia.org/T119944#3363483 (10Aklapper) (For the records, I changed H131 from "Take these actions every time" to "Take these actions only the first time"... [14:23:14] (03PS1) 10Ayounsi: Route around codfw for asw-d-codfw switch upgrade [puppet] - 10https://gerrit.wikimedia.org/r/360357 (https://phabricator.wikimedia.org/T167274) [14:23:18] (03CR) 10Ayounsi: [C: 032] Depool codfw for asw-d-codfw upgrade [dns] - 10https://gerrit.wikimedia.org/r/360352 (https://phabricator.wikimedia.org/T167274) (owner: 10Ayounsi) [14:23:59] 10Operations, 10monitoring, 10Interactive-Sprint, 10Maps (Kartotherian), 10Technical-Debt: Geoshape and geoline subservices need monitoring - https://phabricator.wikimedia.org/T166776#3363485 (10Gehel) Pull request created: https://github.com/kartotherian/kartotherian/pull/31 [14:25:35] (03CR) 10Ema: [C: 031] Depool codfw for asw-d-codfw upgrade [dns] - 10https://gerrit.wikimedia.org/r/360352 (https://phabricator.wikimedia.org/T167274) (owner: 10Ayounsi) [14:25:44] PROBLEM - Check systemd state on ms-be1016 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [14:26:16] (03CR) 10Ema: [C: 031] Route around codfw for asw-d-codfw switch upgrade [puppet] - 10https://gerrit.wikimedia.org/r/360357 (https://phabricator.wikimedia.org/T167274) (owner: 10Ayounsi) [14:26:38] I'll take a look at ms-be1016 too [14:26:44] RECOVERY - Check systemd state on ms-be1016 is OK: OK - running: The system is fully operational [14:26:49] (03CR) 10Ayounsi: [C: 032] Route around codfw for asw-d-codfw switch upgrade [puppet] - 10https://gerrit.wikimedia.org/r/360357 (https://phabricator.wikimedia.org/T167274) (owner: 10Ayounsi) [14:27:28] !log rebooting scb1001 for kernel update [14:27:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:29:44] PROBLEM - Check systemd state on ms-be1016 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [14:32:05] !log depooled codfw - T167274 [14:32:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:14] T167274: codfw row D switch upgrade - https://phabricator.wikimedia.org/T167274 [14:32:44] RECOVERY - Check systemd state on ms-be1016 is OK: OK - running: The system is fully operational [14:33:30] elukey: kafka2001 rebooted already? [14:34:03] mobrovac: nope, for some reason I didn't find a way to depool it (I can see traffic with httpry for eventbus.discovery.wmnet:8085 on the host) [14:35:15] elukey: so eb.discovery does not point to the lvs or sth? [14:35:24] PROBLEM - Check Varnish expiry mailbox lag on cp1074 is CRITICAL: CRITICAL: expiry mailbox lag is 2053677 [14:35:44] PROBLEM - Check systemd state on ms-be1016 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [14:35:59] mobrovac: theoretically it should, I quickly checked pybal logs on lvs2003 (that should be the lvs active) but didn't find anything [14:36:06] in the meantime I am doing other reboots :P [14:36:19] will ping you when I am done [14:36:26] kk thnx elukey [14:36:44] RECOVERY - Check systemd state on ms-be1016 is OK: OK - running: The system is fully operational [14:38:09] this is systemd trying and failing to mount a corrupted fs [14:38:41] 10Operations, 10Wikimedia-IRC-RC-Server, 10User-notice: Reboot irc.wikimedia.org for kernel upgrades - https://phabricator.wikimedia.org/T167643#3363547 (10Gestrid) [14:39:44] PROBLEM - Check systemd state on ms-be1016 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [14:40:04] _joe_ oh sorry, i am not on irc during most of the day bst during the week. Didnt see your message. Yeh we had problems with systemd before so we raised the limit but seems that it is having problems with that limit. So i guess we can do a big jump in the limit to try and prevent it doing it again. [14:41:18] (03CR) 10GWicke: [C: 031] "The check & conditional restart is a lot better than the unconditional restart we had earlier." [puppet] - 10https://gerrit.wikimedia.org/r/359967 (https://phabricator.wikimedia.org/T159922) (owner: 10GWicke) [14:41:19] 10Operations, 10Wikimedia-IRC-RC-Server, 10User-notice: Reboot irc.wikimedia.org for kernel upgrades - https://phabricator.wikimedia.org/T167643#3363554 (10Gestrid) Per @akosiaris ' last comment, I've taken the liberty to update the task description to reflect the now non-tentative date. [14:42:14] 10Operations, 10Electron-PDFs, 10TCB-Team, 10Patch-For-Review, 10User-notice: Deploy ElectronPdfService Extension to production - https://phabricator.wikimedia.org/T150185#2776664 (10mobrovac) We are having problems with the electron service in production (cf. {T159922}) and we need to do something about... [14:42:45] RECOVERY - Check systemd state on ms-be1016 is OK: OK - running: The system is fully operational [14:43:54] PROBLEM - Check Varnish expiry mailbox lag on cp1073 is CRITICAL: CRITICAL: expiry mailbox lag is 2012925 [14:44:12] 10Operations, 10ops-eqiad: IPMI console not working on ms-be1014 / ms-be1015 - https://phabricator.wikimedia.org/T168378#3363577 (10fgiunchedi) 05Open>03Resolved With help from @Cmjohnson we've restored the console on these two boxes by draining flea power [14:44:14] 10Operations, 10Patch-For-Review: Remote IPMI doesn't work for ~2% of the fleet - https://phabricator.wikimedia.org/T150160#3363579 (10fgiunchedi) [14:45:44] PROBLEM - Check systemd state on ms-be1016 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [14:46:11] 10Operations, 10ops-eqiad: IPMI console not working on ms-be1014 / ms-be1015 - https://phabricator.wikimedia.org/T168378#3362681 (10Marostegui) @Cmjohnson could that be the same fix for: T160392 ? [14:46:44] RECOVERY - Check systemd state on ms-be1016 is OK: OK - running: The system is fully operational [14:47:04] !log rolling restart of druid100[123] for kernel upgrades [14:47:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:48:45] (03CR) 10Marostegui: [C: 032] s1.hosts: Add db2072 to s1 [software] - 10https://gerrit.wikimedia.org/r/360318 (https://phabricator.wikimedia.org/T168356) (owner: 10Marostegui) [14:52:54] (03Merged) 10jenkins-bot: s1.hosts: Add db2072 to s1 [software] - 10https://gerrit.wikimedia.org/r/360318 (https://phabricator.wikimedia.org/T168356) (owner: 10Marostegui) [14:54:36] 10Operations, 10ops-eqiad, 10Labs, 10Labs-Infrastructure: rack/setup/install labnodepool1002.eqiad.wmnet - https://phabricator.wikimedia.org/T168407#3363615 (10RobH) [14:57:57] 10Operations, 10ArchCom-RfC, 10Commons, 10MediaWiki-File-management, and 12 others: Use content hash based image / thumb URLs - https://phabricator.wikimedia.org/T149847#3363634 (10GWicke) [15:08:56] !log starting asw-d-codfw switch upgrade - T167274 [15:09:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:09:05] T167274: codfw row D switch upgrade - https://phabricator.wikimedia.org/T167274 [15:12:14] PROBLEM - Host elastic1020 is DOWN: PING CRITICAL - Packet loss = 100% [15:13:04] RECOVERY - Host elastic1020 is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms [15:18:34] PROBLEM - puppet last run on mendelevium is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[initramfs-tools] [15:23:50] (03CR) 10Gilles: [C: 031] thumbor: use jessie-backports as target release for python-thumbor-wikimedia [puppet] - 10https://gerrit.wikimedia.org/r/360350 (https://phabricator.wikimedia.org/T121388) (owner: 10Filippo Giunchedi) [15:25:04] 10Operations, 10ops-eqiad, 10Labs, 10Labs-Infrastructure, and 2 others: rack/setup/install labvirt101[5-8] - https://phabricator.wikimedia.org/T165531#3363765 (10Andrew) [15:30:59] (03PS1) 10Cmjohnson: Per akosaris request, moved ganeti1007-8 to row A, this is the dns change to reflect that change. T166076 [dns] - 10https://gerrit.wikimedia.org/r/360366 [15:31:02] (03PS1) 10Ottomata: Change hive-site.xml group ownership to 'hive' [puppet/cdh] - 10https://gerrit.wikimedia.org/r/360367 [15:31:39] ottomata: --^ s/hive/hdfs ? [15:32:47] (03CR) 10Cmjohnson: [C: 032] Per akosaris request, moved ganeti1007-8 to row A, this is the dns change to reflect that change. T166076 [dns] - 10https://gerrit.wikimedia.org/r/360366 (owner: 10Cmjohnson) [15:36:10] 10Operations, 10ops-eqiad, 10Patch-For-Review: rack/setup/install ganeti1005-ganeti1008 - https://phabricator.wikimedia.org/T166076#3363806 (10Cmjohnson) @akosiaris the 2 servers have been moved and switch and dns updated..they are ready for you whenever you're ready. [15:37:04] PROBLEM - Host cp2020 is DOWN: PING CRITICAL - Packet loss = 100% [15:37:05] PROBLEM - Host cp2019 is DOWN: PING CRITICAL - Packet loss = 100% [15:37:24] PROBLEM - Host cp2022 is DOWN: PING CRITICAL - Packet loss = 100% [15:37:24] PROBLEM - Host cp2021 is DOWN: PING CRITICAL - Packet loss = 100% [15:37:45] that's codfw switch upgrade in process [15:37:54] PROBLEM - Host ms-be2037 is DOWN: PING CRITICAL - Packet loss = 100% [15:37:55] PROBLEM - Host ms-be2024 is DOWN: PING CRITICAL - Packet loss = 100% [15:37:55] PROBLEM - Host ms-be2038 is DOWN: PING CRITICAL - Packet loss = 100% [15:37:55] PROBLEM - Host ms-fe2008 is DOWN: PING CRITICAL - Packet loss = 100% [15:37:55] PROBLEM - Host ms-be2022 is DOWN: PING CRITICAL - Packet loss = 100% [15:37:55] PROBLEM - Host ms-be2023 is DOWN: PING CRITICAL - Packet loss = 100% [15:38:19] if any host pages, it needs its parent/child relationship fixed in Icinga [15:38:34] PROBLEM - configured eth on lvs2005 is CRITICAL: eth3 reporting no carrier. [15:38:34] PROBLEM - configured eth on lvs2004 is CRITICAL: eth3 reporting no carrier. [15:38:44] PROBLEM - configured eth on lvs2006 is CRITICAL: eth3 reporting no carrier. [15:39:04] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 114, down: 2, dormant: 0, excluded: 0, unused: 0BRet-0/2/1: down - Core: asw-d-codfw:et-2/0/51 {#10705} [40Gbps DF]BRae4: down - Core: asw-d-codfw:ae1BR [15:40:26] ...... no bueno? [15:40:35] oh nm [15:40:37] planned upgrade. [15:40:47] robh: yup, T167274 [15:40:49] T167274: codfw row D switch upgrade - https://phabricator.wikimedia.org/T167274 [15:41:35] RECOVERY - configured eth on lvs2004 is OK: OK - interfaces up [15:41:35] RECOVERY - Host cp2022 is UP: PING WARNING - Packet loss = 44%, RTA = 36.03 ms [15:41:44] RECOVERY - Host cp2019 is UP: PING OK - Packet loss = 0%, RTA = 36.05 ms [15:41:45] RECOVERY - Host cp2020 is UP: PING OK - Packet loss = 0%, RTA = 36.07 ms [15:41:45] RECOVERY - Host ms-be2022 is UP: PING OK - Packet loss = 0%, RTA = 36.97 ms [15:41:45] RECOVERY - Host ms-be2024 is UP: PING OK - Packet loss = 0%, RTA = 36.39 ms [15:41:45] RECOVERY - Host ms-be2023 is UP: PING OK - Packet loss = 0%, RTA = 36.70 ms [15:41:45] RECOVERY - Host ms-be2037 is UP: PING OK - Packet loss = 0%, RTA = 36.56 ms [15:41:45] RECOVERY - Host cp2021 is UP: PING OK - Packet loss = 0%, RTA = 36.17 ms [15:41:46] RECOVERY - Host ms-fe2008 is UP: PING OK - Packet loss = 0%, RTA = 36.19 ms [15:41:46] RECOVERY - configured eth on lvs2006 is OK: OK - interfaces up [15:41:54] RECOVERY - Host ms-be2038 is UP: PING OK - Packet loss = 0%, RTA = 36.69 ms [15:42:04] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 122, down: 0, dormant: 0, excluded: 0, unused: 0 [15:42:26] RECOVERY - configured eth on lvs2005 is OK: OK - interfaces up [15:43:04] PROBLEM - IPsec on cp4013 is CRITICAL: Strongswan CRITICAL - ok: 53 not-conn: kafka1012_v4 [15:44:04] RECOVERY - IPsec on cp4013 is OK: Strongswan OK - 54 ESP OK [15:44:04] PROBLEM - Host elastic2020 is DOWN: PING CRITICAL - Packet loss = 100% [15:44:05] PROBLEM - puppet last run on ms-be2024 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:44:34] PROBLEM - Host mc2032 is DOWN: PING CRITICAL - Packet loss = 100% [15:44:44] PROBLEM - puppet last run on cp2020 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:44:44] PROBLEM - puppet last run on ms-be2038 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:44:45] PROBLEM - Host db2078 is DOWN: PING CRITICAL - Packet loss = 100% [15:44:45] PROBLEM - Host elastic2021 is DOWN: PING CRITICAL - Packet loss = 100% [15:44:45] PROBLEM - Host elastic2019 is DOWN: PING CRITICAL - Packet loss = 100% [15:44:45] PROBLEM - Host elastic2034 is DOWN: PING CRITICAL - Packet loss = 100% [15:44:45] PROBLEM - Host es2013 is DOWN: PING CRITICAL - Packet loss = 100% [15:44:46] PROBLEM - Host es2016 is DOWN: PING CRITICAL - Packet loss = 100% [15:44:54] PROBLEM - Host restbase2009 is DOWN: PING CRITICAL - Packet loss = 100% [15:44:54] PROBLEM - Host restbase2012 is DOWN: PING CRITICAL - Packet loss = 100% [15:44:54] PROBLEM - Host restbase2005 is DOWN: PING CRITICAL - Packet loss = 100% [15:44:54] PROBLEM - Host ores2007 is DOWN: PING CRITICAL - Packet loss = 100% [15:44:54] PROBLEM - Host db2088 is DOWN: PING CRITICAL - Packet loss = 100% [15:45:44] RECOVERY - puppet last run on mendelevium is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [15:46:57] (03PS2) 10Ottomata: Change hive-site.xml group ownership to 'hdfs' [puppet/cdh] - 10https://gerrit.wikimedia.org/r/360367 [15:47:38] (03CR) 10Elukey: [C: 031] Change hive-site.xml group ownership to 'hdfs' [puppet/cdh] - 10https://gerrit.wikimedia.org/r/360367 (owner: 10Ottomata) [15:49:41] (03CR) 10Ottomata: [V: 032 C: 032] Change hive-site.xml group ownership to 'hdfs' [puppet/cdh] - 10https://gerrit.wikimedia.org/r/360367 (owner: 10Ottomata) [15:50:32] (03PS1) 10Jforrester: Enable OOjs UI EditPage buttons on es/fr/it/ja/ru-wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/360370 (https://phabricator.wikimedia.org/T162849) [15:50:34] (03PS1) 10Jforrester: Enable OOjs UI EditPage buttons on all Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/360371 (https://phabricator.wikimedia.org/T162849) [15:50:40] PROBLEM - MariaDB Slave IO: es2 on es1011 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@es2016.codfw.wmnet:3306 - retry-time: 60 retries: 86400 message: Cant connect to MySQL server on es2016.codfw.wmnet (110 Connection timed out) [15:50:54] PROBLEM - IPsec on mc1032 is CRITICAL: Strongswan CRITICAL - ok: 0 not-conn: mc2032_v4 [15:50:57] (03CR) 10Jforrester: [C: 04-2] "Not for a while." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/360371 (https://phabricator.wikimedia.org/T162849) (owner: 10Jforrester) [15:51:05] PROBLEM - MariaDB Slave IO: es2 on es2015 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@es2016.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on es2016.codfw.wmnet (110 Connection timed out) [15:51:05] PROBLEM - MariaDB Slave IO: es2 on es2014 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@es2016.codfw.wmnet:3306 - retry-time: 60 retries: 86400 message: Cant connect to MySQL server on es2016.codfw.wmnet (110 Connection timed out) [15:51:13] (03PS1) 10Ottomata: Update cdh module with hive-site.xml group ownership change [puppet] - 10https://gerrit.wikimedia.org/r/360373 [15:51:15] (03CR) 10Jforrester: [C: 04-1] "Planned for 5 July." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/360370 (https://phabricator.wikimedia.org/T162849) (owner: 10Jforrester) [15:51:24] RECOVERY - Host restbase2005 is UP: PING OK - Packet loss = 0%, RTA = 36.09 ms [15:51:24] RECOVERY - Host es2013 is UP: PING OK - Packet loss = 0%, RTA = 36.12 ms [15:51:24] RECOVERY - Host elastic2020 is UP: PING OK - Packet loss = 0%, RTA = 36.11 ms [15:51:24] RECOVERY - Host es2016 is UP: PING OK - Packet loss = 0%, RTA = 36.16 ms [15:51:24] RECOVERY - Host elastic2034 is UP: PING OK - Packet loss = 0%, RTA = 36.11 ms [15:51:24] (03CR) 10Ottomata: [V: 032 C: 032] Update cdh module with hive-site.xml group ownership change [puppet] - 10https://gerrit.wikimedia.org/r/360373 (owner: 10Ottomata) [15:51:25] RECOVERY - Host restbase2009 is UP: PING OK - Packet loss = 0%, RTA = 37.24 ms [15:51:25] RECOVERY - Host elastic2021 is UP: PING OK - Packet loss = 0%, RTA = 36.09 ms [15:51:26] RECOVERY - Host mc2032 is UP: PING OK - Packet loss = 0%, RTA = 36.12 ms [15:51:27] RECOVERY - Host restbase2012 is UP: PING OK - Packet loss = 0%, RTA = 36.06 ms [15:51:27] RECOVERY - Host db2078 is UP: PING OK - Packet loss = 0%, RTA = 36.09 ms [15:51:27] RECOVERY - Host db2088 is UP: PING OK - Packet loss = 0%, RTA = 36.09 ms [15:51:28] RECOVERY - Host elastic2019 is UP: PING OK - Packet loss = 0%, RTA = 36.17 ms [15:51:39] RECOVERY - MariaDB Slave IO: es2 on es1011 is OK: OK slave_io_state Slave_IO_Running: Yes [15:51:39] RECOVERY - Host ores2007 is UP: PING OK - Packet loss = 0%, RTA = 36.08 ms [15:51:54] RECOVERY - IPsec on mc1032 is OK: Strongswan OK - 1 ESP OK [15:52:04] RECOVERY - MariaDB Slave IO: es2 on es2015 is OK: OK slave_io_state Slave_IO_Running: Yes [15:52:04] RECOVERY - MariaDB Slave IO: es2 on es2014 is OK: OK slave_io_state Slave_IO_Running: Yes [15:54:04] PROBLEM - puppet last run on db2078 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:54:05] PROBLEM - puppet last run on db2088 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:54:05] PROBLEM - puppet last run on restbase2005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:56:14] PROBLEM - Check Varnish expiry mailbox lag on cp1072 is CRITICAL: CRITICAL: expiry mailbox lag is 2084331 [15:56:17] hasharAway: fyi andrewbogott is going to help babysit the nodepool rate bump [16:00:04] godog, moritzm, and _joe_: Dear anthropoid, the time has come. Please deploy Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170620T1600). [16:00:15] RECOVERY - nova instance creation test on labnet1001 is OK: PROCS OK: 1 process with command name python, args nova-fullstack [16:01:37] ^ andrewbogott nova-fullstack something you did for recovery? [16:02:04] I don't think I did anything [16:03:14] PROBLEM - nova instance creation test on labnet1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name python, args nova-fullstack [16:03:21] heh, there we go :) [16:04:12] andrewbogott: seems to be flapping then ah [16:04:15] andrewbogott: because 2017-06-20 16:00:11,605 ERROR max server(s) with prepend fullstackd [16:04:20] it leaked to teh limit [16:04:57] yeah, I haven't gone through and audited the failures yet [16:05:33] 10Operations, 10ops-eqiad, 10Labs, 10Labs-Infrastructure, and 2 others: rack/setup/install labvirt101[5-8] - https://phabricator.wikimedia.org/T165531#3363907 (10RobH) a:05Cmjohnson>03RobH [16:05:51] !log reboot kafka1013 for kernel upgrade [16:06:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:06:12] andrewbogott: do you have time? otherwise we should stop the service manually or just clear out some of the older ones [16:06:19] it's logging a lot of failures etc trying to restart [16:06:26] I think I'm fine if you want to just clear it out for now. [16:06:34] PROBLEM - Host mc2033 is DOWN: PING CRITICAL - Packet loss = 100% [16:06:38] Either there will be more failures for investigation or there won't, and either way we win :) [16:06:44] PROBLEM - Host ores2008 is DOWN: PING CRITICAL - Packet loss = 100% [16:09:32] !log openstack server delete admin-monitoring openstack project instances (we have leaked 7) [16:09:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:09:44] RECOVERY - puppet last run on cp2020 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [16:10:55] RECOVERY - puppet last run on ms-be2038 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [16:11:15] RECOVERY - puppet last run on ms-be2024 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [16:12:15] RECOVERY - nova instance creation test on labnet1001 is OK: PROCS OK: 1 process with command name python, args nova-fullstack [16:12:24] PROBLEM - IPsec on mc1033 is CRITICAL: Strongswan CRITICAL - ok: 0 not-conn: mc2033_v4 [16:13:04] RECOVERY - Host ores2008 is UP: PING OK - Packet loss = 0%, RTA = 42.60 ms [16:13:04] RECOVERY - Host mc2033 is UP: PING OK - Packet loss = 16%, RTA = 43.19 ms [16:13:24] RECOVERY - IPsec on mc1033 is OK: Strongswan OK - 1 ESP OK [16:15:14] RECOVERY - puppet last run on db2078 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [16:16:34] PROBLEM - Host wdqs2002 is DOWN: PING CRITICAL - Packet loss = 100% [16:16:55] PROBLEM - Host conf2003 is DOWN: PING CRITICAL - Packet loss = 100% [16:17:14] RECOVERY - puppet last run on db2088 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [16:17:15] PROBLEM - Host ores2009 is DOWN: PING CRITICAL - Packet loss = 100% [16:17:24] PROBLEM - Host puppetmaster2002 is DOWN: PING CRITICAL - Packet loss = 100% [16:17:24] PROBLEM - Host scb2002 is DOWN: PING CRITICAL - Packet loss = 100% [16:17:25] PROBLEM - Host wasat is DOWN: PING CRITICAL - Packet loss = 100% [16:17:25] PROBLEM - Host elastic2036 is DOWN: PING CRITICAL - Packet loss = 100% [16:17:25] PROBLEM - Host elastic2035 is DOWN: PING CRITICAL - Packet loss = 100% [16:17:25] PROBLEM - Host elastic2022 is DOWN: PING CRITICAL - Packet loss = 100% [16:17:25] PROBLEM - Host elastic2024 is DOWN: PING CRITICAL - Packet loss = 100% [16:17:26] PROBLEM - Host wezen is DOWN: PING CRITICAL - Packet loss = 100% [16:17:34] PROBLEM - Host rdb2006 is DOWN: PING CRITICAL - Packet loss = 100% [16:17:35] PROBLEM - Host maps2004 is DOWN: PING CRITICAL - Packet loss = 100% [16:17:35] PROBLEM - Host gerrit2001 is DOWN: PING CRITICAL - Packet loss = 100% [16:17:35] PROBLEM - Host restbase2006 is DOWN: PING CRITICAL - Packet loss = 100% [16:17:35] PROBLEM - Host mc2035 is DOWN: PING CRITICAL - Packet loss = 100% [16:17:35] PROBLEM - Host pc2006 is DOWN: PING CRITICAL - Packet loss = 100% [16:17:35] PROBLEM - Host mc2034 is DOWN: PING CRITICAL - Packet loss = 100% [16:17:36] PROBLEM - Host elastic2023 is DOWN: PING CRITICAL - Packet loss = 100% [16:17:38] PROBLEM - Host kubernetes2004 is DOWN: PING CRITICAL - Packet loss = 100% [16:18:58] (03PS1) 10Cmjohnson: Adding mgmt dns for conf1004-6 T166081 [dns] - 10https://gerrit.wikimedia.org/r/360377 [16:19:24] RECOVERY - puppet last run on restbase2005 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [16:20:40] PROBLEM - ElasticSearch health check for shards on search.svc.codfw.wmnet is CRITICAL: CRITICAL - elasticsearch inactive shards 1141 threshold =0.1 breach: status: yellow, number_of_nodes: 31, unassigned_shards: 1054, number_of_pending_tasks: 11, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3072, task_max_waiting_in_queue_millis: 198251, cluster_name: production-search-codfw, relocating_shards: 0, acti [16:20:40] s_number: 87.6233864844, active_shards: 8078, initializing_shards: 87, number_of_data_nodes: 31, delayed_unassigned_shards: 0 [16:22:05] ACKNOWLEDGEMENT - ElasticSearch health check for shards on search.svc.codfw.wmnet is CRITICAL: CRITICAL - elasticsearch inactive shards 1096 threshold =0.1 breach: status: yellow, number_of_nodes: 31, unassigned_shards: 1010, number_of_pending_tasks: 2, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3072, task_max_waiting_in_queue_millis: 947, cluster_name: production-search-codfw, relocating_shards: 0, [16:22:05] nt_as_number: 88.1115088404, active_shards: 8123, initializing_shards: 86, number_of_data_nodes: 31, delayed_unassigned_shards: 0 Gehel row D switch upgrade in progress - lots of shards relocating, but cluster is still yellow - https://phabricator.wikimedia.org/T167274 [16:22:54] PROBLEM - IPsec on mc1035 is CRITICAL: Strongswan CRITICAL - ok: 0 not-conn: mc2035_v4 [16:22:54] PROBLEM - IPsec on mc1034 is CRITICAL: Strongswan CRITICAL - ok: 0 not-conn: mc2034_v4 [16:23:54] RECOVERY - IPsec on mc1035 is OK: Strongswan OK - 1 ESP OK [16:23:54] RECOVERY - IPsec on mc1034 is OK: Strongswan OK - 1 ESP OK [16:23:54] RECOVERY - Host restbase2006 is UP: PING OK - Packet loss = 0%, RTA = 42.14 ms [16:23:55] RECOVERY - Host elastic2024 is UP: PING OK - Packet loss = 0%, RTA = 42.83 ms [16:23:55] RECOVERY - Host elastic2022 is UP: PING OK - Packet loss = 0%, RTA = 43.47 ms [16:23:55] RECOVERY - Host pc2006 is UP: PING OK - Packet loss = 0%, RTA = 43.12 ms [16:23:55] RECOVERY - Host scb2002 is UP: PING OK - Packet loss = 0%, RTA = 43.28 ms [16:23:56] RECOVERY - Host rdb2006 is UP: PING OK - Packet loss = 0%, RTA = 43.19 ms [16:23:56] RECOVERY - Host maps2004 is UP: PING OK - Packet loss = 0%, RTA = 43.55 ms [16:23:57] RECOVERY - Host gerrit2001 is UP: PING OK - Packet loss = 0%, RTA = 36.09 ms [16:23:57] RECOVERY - Host puppetmaster2002 is UP: PING OK - Packet loss = 0%, RTA = 42.70 ms [16:23:58] RECOVERY - Host wasat is UP: PING OK - Packet loss = 0%, RTA = 42.43 ms [16:23:58] RECOVERY - Host wdqs2002 is UP: PING OK - Packet loss = 0%, RTA = 42.62 ms [16:23:59] RECOVERY - Host elastic2036 is UP: PING OK - Packet loss = 0%, RTA = 42.82 ms [16:24:00] RECOVERY - Host elastic2035 is UP: PING OK - Packet loss = 0%, RTA = 42.25 ms [16:24:00] RECOVERY - Host wezen is UP: PING OK - Packet loss = 0%, RTA = 42.02 ms [16:24:14] RECOVERY - Host ores2009 is UP: PING OK - Packet loss = 0%, RTA = 45.04 ms [16:24:49] RECOVERY - ElasticSearch health check for shards on search.svc.codfw.wmnet is OK: OK - elasticsearch status production-search-codfw: status: yellow, number_of_nodes: 36, unassigned_shards: 847, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3072, task_max_waiting_in_queue_millis: 0, cluster_name: production-search-codfw, relocating_shards: 0, active_shards_percent_as_number: 9 [16:24:49] e_shards: 8323, initializing_shards: 49, number_of_data_nodes: 36, delayed_unassigned_shards: 0 [16:26:04] RECOVERY - Check systemd state on xenon is OK: OK - running: The system is fully operational [16:26:14] PROBLEM - puppet last run on scb2002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:26:15] PROBLEM - puppet last run on elastic2024 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:26:15] PROBLEM - puppet last run on ores2009 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:26:34] RECOVERY - cassandra-a service on xenon is OK: OK - cassandra-a is active [16:26:45] RECOVERY - cassandra-a SSL 10.64.0.202:7001 on xenon is OK: SSL OK - Certificate xenon-a valid until 2017-09-08 16:32:33 +0000 (expires in 80 days) [16:26:55] mobrovac: ^^ [16:27:04] RECOVERY - cassandra-a CQL 10.64.0.202:9042 on xenon is OK: TCP OK - 0.000 second response time on 10.64.0.202 port 9042 [16:27:48] urandom: \o/ [16:28:24] PROBLEM - Host db2063 is DOWN: PING CRITICAL - Packet loss = 100% [16:28:24] PROBLEM - Host db2054 is DOWN: PING CRITICAL - Packet loss = 100% [16:28:25] PROBLEM - Host db2060 is DOWN: PING CRITICAL - Packet loss = 100% [16:28:25] PROBLEM - Host es2019 is DOWN: PING CRITICAL - Packet loss = 100% [16:28:25] PROBLEM - Host db2067 is DOWN: PING CRITICAL - Packet loss = 100% [16:28:25] PROBLEM - Host db2053 is DOWN: PING CRITICAL - Packet loss = 100% [16:28:25] PROBLEM - Host db2056 is DOWN: PING CRITICAL - Packet loss = 100% [16:28:26] PROBLEM - Host db2057 is DOWN: PING CRITICAL - Packet loss = 100% [16:28:26] PROBLEM - Host db2058 is DOWN: PING CRITICAL - Packet loss = 100% [16:28:27] PROBLEM - Host db2064 is DOWN: PING CRITICAL - Packet loss = 100% [16:28:27] PROBLEM - Host db2068 is DOWN: PING CRITICAL - Packet loss = 100% [16:28:28] PROBLEM - Host db2061 is DOWN: PING CRITICAL - Packet loss = 100% [16:28:28] PROBLEM - Host db2069 is DOWN: PING CRITICAL - Packet loss = 100% [16:28:29] PROBLEM - Host db2055 is DOWN: PING CRITICAL - Packet loss = 100% [16:28:45] PROBLEM - Host db2084 is DOWN: PING CRITICAL - Packet loss = 100% [16:29:24] PROBLEM - PyBal backends health check on lvs2006 is CRITICAL: PYBAL CRITICAL - apaches_80 - Could not depool server mw2181.codfw.wmnet because of too many down!: api-https_443 - Could not depool server mw2222.codfw.wmnet because of too many down!: appservers-https_443 - Could not depool server mw2181.codfw.wmnet because of too many down!: api_80 - Could not depool server mw2222.codfw.wmnet because of too many down! [16:29:24] PROBLEM - PyBal backends health check on lvs2003 is CRITICAL: PYBAL CRITICAL - apaches_80 - Could not depool server mw2233.codfw.wmnet because of too many down!: api-https_443 - Could not depool server mw2133.codfw.wmnet because of too many down!: appservers-https_443 - Could not depool server mw2110.codfw.wmnet because of too many down!: api_80 - Could not depool server mw2132.codfw.wmnet because of too many down! [16:30:13] 10Operations, 10ops-eqiad, 10Labs, 10Labs-Infrastructure, and 2 others: rack/setup/install labvirt101[5-8] - https://phabricator.wikimedia.org/T165531#3364014 (10RobH) @andrew: These hosts were reviewed and approved for order with 10 * 1.6TB Intel S3510 SSDs. With hardware raid, you raid the ENTIRE disk,... [16:32:35] (03CR) 10Cmjohnson: [C: 032] Adding mgmt dns for conf1004-6 T166081 [dns] - 10https://gerrit.wikimedia.org/r/360377 (owner: 10Cmjohnson) [16:34:54] RECOVERY - Host db2054 is UP: PING OK - Packet loss = 0%, RTA = 46.10 ms [16:34:54] RECOVERY - Host db2061 is UP: PING OK - Packet loss = 0%, RTA = 47.34 ms [16:34:54] RECOVERY - Host db2053 is UP: PING OK - Packet loss = 0%, RTA = 47.03 ms [16:34:54] RECOVERY - Host db2064 is UP: PING OK - Packet loss = 0%, RTA = 46.54 ms [16:34:55] RECOVERY - Host db2055 is UP: PING OK - Packet loss = 0%, RTA = 46.03 ms [16:34:55] RECOVERY - Host db2060 is UP: PING OK - Packet loss = 0%, RTA = 46.33 ms [16:34:55] RECOVERY - Host db2063 is UP: PING OK - Packet loss = 0%, RTA = 46.70 ms [16:34:56] RECOVERY - Host es2019 is UP: PING OK - Packet loss = 0%, RTA = 45.87 ms [16:34:56] RECOVERY - Host db2056 is UP: PING OK - Packet loss = 0%, RTA = 46.05 ms [16:34:57] RECOVERY - Host db2065 is UP: PING OK - Packet loss = 0%, RTA = 46.46 ms [16:34:57] RECOVERY - Host db2069 is UP: PING OK - Packet loss = 0%, RTA = 46.97 ms [16:34:58] RECOVERY - Host db2084 is UP: PING OK - Packet loss = 0%, RTA = 46.85 ms [16:34:58] RECOVERY - Host db2058 is UP: PING OK - Packet loss = 0%, RTA = 47.21 ms [16:34:59] RECOVERY - Host db2057 is UP: PING OK - Packet loss = 16%, RTA = 46.95 ms [16:35:04] RECOVERY - Host db2067 is UP: PING OK - Packet loss = 0%, RTA = 45.88 ms [16:35:05] RECOVERY - Host db2074 is UP: PING OK - Packet loss = 0%, RTA = 46.29 ms [16:35:05] RECOVERY - Host db2068 is UP: PING OK - Packet loss = 0%, RTA = 46.24 ms [16:35:24] RECOVERY - PyBal backends health check on lvs2006 is OK: PYBAL OK - All pools are healthy [16:35:24] RECOVERY - PyBal backends health check on lvs2003 is OK: PYBAL OK - All pools are healthy [16:35:44] PROBLEM - puppet last run on mw2187 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:36:09] 10Operations, 10ops-eqiad, 10Labs, 10Labs-Infrastructure, and 2 others: rack/setup/install labvirt101[5-8] - https://phabricator.wikimedia.org/T165531#3364029 (10Andrew) > Would that be acceptable? Yep, sounds great. Thank you. [16:37:04] PROBLEM - puppet last run on db2060 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:37:05] PROBLEM - puppet last run on db2067 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:37:45] PROBLEM - puppet last run on db2069 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:39:04] 10Operations, 10ops-eqiad, 10Patch-For-Review: rack/setup/install ganeti1005-ganeti1008 - https://phabricator.wikimedia.org/T166076#3364038 (10akosiaris) [16:39:43] 10Operations, 10ops-eqiad, 10Patch-For-Review: rack/setup/install ganeti1005-ganeti1008 - https://phabricator.wikimedia.org/T166076#3284602 (10akosiaris) 05Open>03Resolved Very nice! Thank you @Cmjohnson. Boxes are being installed and added to the ganeti cluster as we speak [16:41:24] PROBLEM - Host mc2036 is DOWN: PING CRITICAL - Packet loss = 100% [16:41:44] PROBLEM - citoid endpoints health on scb2003 is CRITICAL: /api (Zotero alive) timed out before a response was received [16:41:44] PROBLEM - citoid endpoints health on scb2004 is CRITICAL: /api (Zotero alive) timed out before a response was received [16:41:53] PROBLEM - citoid endpoints health on scb2006 is CRITICAL: /api (Zotero alive) timed out before a response was received: /api (Scrapes sample page) timed out before a response was received [16:42:31] i guess that's connected to the network maintenance ^ ? [16:43:34] mobrovac: probably, yes [16:43:43] RECOVERY - citoid endpoints health on scb2003 is OK: All endpoints are healthy [16:43:43] RECOVERY - citoid endpoints health on scb2004 is OK: All endpoints are healthy [16:43:43] RECOVERY - citoid endpoints health on scb2006 is OK: All endpoints are healthy [16:43:50] i like your political answer ema :) [16:44:39] mobrovac: :) [16:45:53] PROBLEM - puppet last run on ganeti1007 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 4 minutes ago with 3 failures. Failed resources (up to 3 shown): File_line[login.defs-SYS_GID_MAX],Service[sysfsutils],File_line[login.defs-SYS_UID_MAX] [16:45:53] XioNoX: how many switches to go? [16:46:53] PROBLEM - IPsec on mc1036 is CRITICAL: Strongswan CRITICAL - ok: 0 not-conn: mc2036_v4 [16:46:54] ema: almost done with 8, then there is 7 to finish with [16:47:03] nice [16:47:19] and then enable igmp snooping [16:47:53] RECOVERY - Host mc2036 is UP: PING OK - Packet loss = 0%, RTA = 47.19 ms [16:47:54] RECOVERY - IPsec on mc1036 is OK: Strongswan OK - 1 ESP OK [16:49:33] RECOVERY - puppet last run on ores2009 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [16:50:33] RECOVERY - puppet last run on scb2002 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [16:52:33] RECOVERY - puppet last run on elastic2024 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [16:53:06] !log updating the d-i image for stretch in puppet volatile [16:53:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:53:53] PROBLEM - Host cp2024 is DOWN: PING CRITICAL - Packet loss = 100% [16:53:53] PROBLEM - Host cp2023 is DOWN: PING CRITICAL - Packet loss = 100% [16:54:03] PROBLEM - Host cp2025 is DOWN: PING CRITICAL - Packet loss = 100% [16:54:23] PROBLEM - Host cp2026 is DOWN: PING CRITICAL - Packet loss = 100% [16:54:53] PROBLEM - Host ms-be2027 is DOWN: PING CRITICAL - Packet loss = 100% [16:54:53] PROBLEM - Host ms-be2025 is DOWN: PING CRITICAL - Packet loss = 100% [16:54:54] PROBLEM - Host ms-be2026 is DOWN: PING CRITICAL - Packet loss = 100% [16:54:54] PROBLEM - Host ms-be2039 is DOWN: PING CRITICAL - Packet loss = 100% [16:55:03] PROBLEM - configured eth on lvs2002 is CRITICAL: eth3 reporting no carrier. [16:55:26] !log Ran maintain-meta_p --all-databases on labsdb1001 [16:55:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:55:54] PROBLEM - configured eth on lvs2003 is CRITICAL: eth3 reporting no carrier. [16:55:54] PROBLEM - configured eth on lvs2001 is CRITICAL: eth3 reporting no carrier. [16:56:03] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 114, down: 2, dormant: 0, excluded: 0, unused: 0BRae4: down - Core: asw-d-codfw:ae2BRet-0/2/1: down - Core: asw-d-codfw:et-7/0/52 {#10709} [40Gbps DF]BR [16:57:43] RECOVERY - Host cp2023 is UP: PING OK - Packet loss = 0%, RTA = 36.15 ms [16:57:43] RECOVERY - Host cp2025 is UP: PING OK - Packet loss = 0%, RTA = 36.24 ms [16:57:43] RECOVERY - Host ms-be2027 is UP: PING OK - Packet loss = 0%, RTA = 36.77 ms [16:57:43] RECOVERY - Host cp2026 is UP: PING OK - Packet loss = 0%, RTA = 36.05 ms [16:57:43] RECOVERY - Host ms-be2039 is UP: PING OK - Packet loss = 0%, RTA = 36.28 ms [16:57:44] RECOVERY - Host ms-be2026 is UP: PING OK - Packet loss = 0%, RTA = 36.06 ms [16:57:44] RECOVERY - Host cp2024 is UP: PING OK - Packet loss = 0%, RTA = 36.05 ms [16:57:45] RECOVERY - Host ms-be2025 is UP: PING OK - Packet loss = 0%, RTA = 36.22 ms [16:58:03] 10Operations, 10MW-1.30-release-notes, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, 10Patch-For-Review: Create Atikamekw Wikipedia - https://phabricator.wikimedia.org/T167714#3364132 (10Benoit_Rochon) Hello! I noticed that Visual Editor is not working. If you try to edit a page that way, it's b... [16:58:03] RECOVERY - configured eth on lvs2003 is OK: OK - interfaces up [16:58:03] RECOVERY - configured eth on lvs2001 is OK: OK - interfaces up [16:58:03] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 122, down: 0, dormant: 0, excluded: 0, unused: 0 [16:58:03] RECOVERY - configured eth on lvs2002 is OK: OK - interfaces up [16:58:53] RECOVERY - puppet last run on db2069 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [16:59:34] upgrade done [16:59:53] PROBLEM - puppet last run on cp2025 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:00:04] PROBLEM - puppet last run on ms-be2025 is CRITICAL: CRITICAL: Puppet has 9 failures. Last run 2 minutes ago with 9 failures. Failed resources (up to 3 shown): File[/etc/swift/swift-drive-audit.conf],File[/usr/bin/swift-drive-audit],File[/home/midom],File[/home/yuvipanda] [17:00:04] gwicke, cscott, arlolra, subbu, halfak, and Amir1: Respected human, time to deploy Services – Graphoid / Parsoid / OCG / Citoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170620T1700). Please do the needful. [17:00:23] PROBLEM - puppet last run on cp2024 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:01:12] parsoid deploy in a bit. [17:01:13] !log Ran maintain-meta_p --all-databases on labsdb1003 [17:01:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:01:51] 10Operations, 10Electron-PDFs, 10Services, 10Patch-For-Review: pdfrender fails to serve requests since Mar 8 00:30:32 UTC on scb1003 - https://phabricator.wikimedia.org/T159922#3364153 (10GWicke) Marko's new version of the patch actually checks whether the service is responsive, and only restarts it when n... [17:01:57] 10Operations, 10ops-eqiad, 10Labs, 10Labs-Infrastructure, and 2 others: rack/setup/install labvirt101[5-8] - https://phabricator.wikimedia.org/T165531#3364154 (10RobH) Ok, further updates. I'll write the partman recipe and get the OS isntallation done on these. However, all of these hosts will need their... [17:02:23] RECOVERY - puppet last run on cp2024 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [17:02:50] 10Operations, 10ops-eqiad, 10Labs, 10Labs-Infrastructure, and 2 others: rack/setup/install labvirt101[5-8] - https://phabricator.wikimedia.org/T165531#3364156 (10RobH) [17:03:16] 10Operations, 10netops, 10Patch-For-Review: codfw row D switch upgrade - https://phabricator.wikimedia.org/T167274#3364158 (10ayounsi) Upgrade done. Took a bit longer than expected ~1h45min. But process was smooth. Full logs on P5597 [17:03:33] RECOVERY - puppet last run on db2067 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [17:03:33] RECOVERY - puppet last run on db2060 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [17:03:53] RECOVERY - puppet last run on cp2025 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [17:03:53] RECOVERY - puppet last run on mw2187 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [17:04:28] (03PS1) 10Ema: Revert "Route around codfw for asw-d-codfw switch upgrade" [puppet] - 10https://gerrit.wikimedia.org/r/360381 [17:04:40] 10Operations, 10ops-eqiad, 10Labs, 10Labs-Infrastructure, and 2 others: rack/setup/install labvirt101[5-8] - https://phabricator.wikimedia.org/T165531#3364159 (10chasemp) @Cmjohnson @RobH thanks guys, post install assign to me and I'll take care of it. [17:04:47] !log re-enable igmp-snooping on asw-d-codfw [17:04:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:06:03] RECOVERY - puppet last run on ganeti1007 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [17:08:04] 10Operations, 10Discovery, 10Maps, 10monitoring: Map caches metrics look broken - https://phabricator.wikimedia.org/T141186#3364165 (10Dzahn) I think so, yes. We already have the "is deprecated" message up on Ganglia web UI since a while and it's what Bblack said above, and i don't know much more than rest... [17:10:12] mobrovac: will do the eventbus reboots tomorrow! [17:10:37] 10Operations, 10Domains, 10Traffic, 10Wikimedia-Site-requests, 10HTTPS: SSL error for https://wikispecies.org/ - https://phabricator.wikimedia.org/T164868#3249552 (10Dzahn) I think it might be worth this _one and only_ exception to add this domain to the main cert. Of course we don't want to do that with... [17:13:02] 10Operations, 10Discovery, 10Elasticsearch, 10Discovery-Search (Current work), 10Patch-For-Review: Elasticsearch errors about BulkShardRequest - https://phabricator.wikimedia.org/T167091#3364237 (10debt) a:03EBernhardson [17:13:50] 10Operations, 10Discovery, 10Elasticsearch, 10Discovery-Search (Current work), 10Patch-For-Review: Elasticsearch errors about BulkShardRequest - https://phabricator.wikimedia.org/T167091#3317363 (10debt) a:05EBernhardson>03dcausse [17:14:21] (03CR) 10Smalyshev: "@hoo I thought on phabricator you said "I would prefer to add https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.ttl.bz2"? I can" [puppet] - 10https://gerrit.wikimedia.org/r/358783 (https://phabricator.wikimedia.org/T164783) (owner: 10Smalyshev) [17:15:17] (03PS1) 10Ema: Revert "Depool codfw for asw-d-codfw upgrade" [dns] - 10https://gerrit.wikimedia.org/r/360386 [17:16:11] RECOVERY - Check Varnish expiry mailbox lag on cp1072 is OK: OK: expiry mailbox lag is 4 [17:16:38] !log restart redis-instance-tcp_6380.service on rdb2004 to force sync with its master [17:16:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:18:11] RECOVERY - puppet last run on ms-be2025 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [17:18:31] PROBLEM - Host elastic1021 is DOWN: PING CRITICAL - Packet loss = 100% [17:21:36] !log restart redis-instance-tcp_6380.service on rdb2003 to force sync with its master [17:21:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:22:11] RECOVERY - Host elastic1021 is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms [17:22:48] 10Operations, 10ops-codfw, 10Labs, 10Labs-Infrastructure: rack/setup/install labtestpuppetmaster2001 - https://phabricator.wikimedia.org/T167157#3364338 (10Papaul) In the process of troubleshooting the pxe boot issue on this system, I setup a test dhcp/dns/tftp server on my laptop and boot the server to it... [17:23:02] PROBLEM - Check health of redis instance on 6380 on rdb2003 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 127.0.0.1 on port 6380 [17:24:02] RECOVERY - Check health of redis instance on 6380 on rdb2003 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6380 has 1 databases (db0) with 8945971 keys, up 3 minutes 3 seconds - replication_delay is 0 [17:24:07] thank you [17:29:47] !log running a script in tmux on rdb200[34] called "check" to dump periodically LLEN enwiki:jobqueue:enqueue:l-unclaimed [17:29:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:30:11] (03PS1) 10Dzahn: fix DNS for labtestpuppetmaster, 1002 != 2001 [dns] - 10https://gerrit.wikimedia.org/r/360388 (https://phabricator.wikimedia.org/T167157) [17:30:44] it runs every 5 mins, I'd need to verify how soon the slaves get out of sync after getting restarted [17:32:40] 10Operations, 10MW-1.30-release-notes, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, 10Patch-For-Review: Create Atikamekw Wikipedia - https://phabricator.wikimedia.org/T167714#3364428 (10Reedy) I'm guessing based on https://wikitech.wikimedia.org/wiki/Add_a_wiki#Parsoid that https://gerrit.wikim... [17:34:05] (03CR) 10Dzahn: [C: 032] fix DNS for labtestpuppetmaster, 1002 != 2001 [dns] - 10https://gerrit.wikimedia.org/r/360388 (https://phabricator.wikimedia.org/T167157) (owner: 10Dzahn) [17:34:29] !log arlolra@tin Started deploy [parsoid/deploy@4b60bf9]: Updating Parsoid to e2e2b5f6 [17:34:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:35:01] 10Operations, 10Electron-PDFs, 10Services, 10Patch-For-Review: pdfrender fails to serve requests since Mar 8 00:30:32 UTC on scb1003 - https://phabricator.wikimedia.org/T159922#3364442 (10GWicke) I am also wondering if there were any hangs during normal operation (not after a manual restart) before the Ele... [17:37:32] 10Operations, 10ops-codfw, 10Labs, 10Labs-Infrastructure, 10Patch-For-Review: rack/setup/install labtestpuppetmaster2001 - https://phabricator.wikimedia.org/T167157#3319424 (10Dzahn) >>! In T167157#3364338, @Papaul wrote: > Jun 20 17:21:43 install2002 dhcpd[11106]: DHCPDISCOVER from 30:e1:71:63:5e:5c via... [17:37:45] 10Operations, 10ops-codfw, 10Labs, 10Labs-Infrastructure, 10Patch-For-Review: rack/setup/install labtestpuppetmaster2001 - https://phabricator.wikimedia.org/T167157#3364465 (10Papaul) Daniel find out that for 208.80.153.108 reverse lookup = 2001 and forward lookup = 1002 He fixed it and will try inst... [17:40:34] !log mwreleaeses1001 - puppet node clean, puppet node deactivate - was reinstalled as releases1001 [17:40:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:41:55] (03PS1) 10RobH: labvirt101[5-8] install params [puppet] - 10https://gerrit.wikimedia.org/r/360391 [17:42:26] !log arlolra@tin Finished deploy [parsoid/deploy@4b60bf9]: Updating Parsoid to e2e2b5f6 (duration: 07m 57s) [17:42:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:43:01] 10Operations, 10Ops-Access-Requests: Access request for Daniel Worley to analytics / hadoop - https://phabricator.wikimedia.org/T168439#3364503 (10EBernhardson) [17:43:03] !log RT - ununpentium - upgradeed rt4-db-mysql [17:43:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:43:13] 10Operations, 10Ops-Access-Requests: Access request for Daniel Worley to analytics / hadoop - https://phabricator.wikimedia.org/T168439#3364516 (10EBernhardson) @EBjune This will require your approval [17:43:36] (03CR) 10RobH: [C: 032] labvirt101[5-8] install params [puppet] - 10https://gerrit.wikimedia.org/r/360391 (owner: 10RobH) [17:45:03] (03CR) 10Ayounsi: [C: 032] Revert "Depool codfw for asw-d-codfw upgrade" [dns] - 10https://gerrit.wikimedia.org/r/360386 (owner: 10Ema) [17:45:10] (03PS2) 10Ayounsi: Revert "Depool codfw for asw-d-codfw upgrade" [dns] - 10https://gerrit.wikimedia.org/r/360386 (owner: 10Ema) [17:45:17] !log deploying wdqs blazegraph and GUI updates [17:45:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:46:23] 10Operations, 10ops-eqiad, 10Patch-For-Review, 10User-Elukey, 10User-Joe: rack/setup/install conf1004-conf1006 - https://phabricator.wikimedia.org/T166081#3364542 (10Cmjohnson) [17:46:38] !log gehel@tin Started deploy [wdqs/wdqs@b60d224]: (no justification provided) [17:46:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:47:27] !log repool codfw - T167274 [17:47:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:47:38] T167274: codfw row D switch upgrade - https://phabricator.wikimedia.org/T167274 [17:48:20] !log gehel@tin Finished deploy [wdqs/wdqs@b60d224]: (no justification provided) (duration: 01m 41s) [17:48:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:48:37] SMalyshev: deploy completed, tests are green [17:49:05] 10Operations, 10netops, 10Patch-For-Review: codfw row D switch upgrade - https://phabricator.wikimedia.org/T167274#3364565 (10ayounsi) 05Open>03Resolved [17:49:52] !log Since arlolra noticed some unexpected warnings from the canaries, the Parsoid deploy was rolled back, so Parsoid was not updated to e2e2b5f6 (contrary to what scap said above). [17:50:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:51:51] (03PS2) 10Ema: Revert "Route around codfw for asw-d-codfw switch upgrade" [puppet] - 10https://gerrit.wikimedia.org/r/360381 (https://phabricator.wikimedia.org/T167274) [17:52:59] !log cobalt (gerrit) - re-enabling puppet, running it. nothing should change, the system unit file mentioned in T168360#3362314 does not get installed by puppet, it comes from the deb [17:53:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:53:09] T168360: Gerrit constantly throws HTTP 500 error when reviewing patches (due to "Too many open files") - https://phabricator.wikimedia.org/T168360 [17:53:25] gehel: thanks [17:59:35] 10Operations, 10Gerrit, 10Patch-For-Review: Gerrit constantly throws HTTP 500 error when reviewing patches (due to "Too many open files") - https://phabricator.wikimedia.org/T168360#3364632 (10Dzahn) Thanks joe! So.. that systemd unit file is installed from the .deb, not by puppet. I said we should remove i... [18:02:57] !log ssh labsdb101[0|1].eqiad.wmnet 'sudo maintain-meta_p --all-databases --debug' [18:03:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:04:20] 10Operations, 10Gerrit, 10Patch-For-Review: Gerrit constantly throws HTTP 500 error when reviewing patches (due to "Too many open files") - https://phabricator.wikimedia.org/T168360#3364645 (10Dzahn) I would say let's delete the unit file that is installed now, remove it from package the next time we build a... [18:04:44] (03PS1) 10Cmjohnson: Adding mgmt dns entries for ores1001-9 [dns] - 10https://gerrit.wikimedia.org/r/360395 [18:05:43] (03PS3) 10Ema: Revert "Route around codfw for asw-d-codfw switch upgrade" [puppet] - 10https://gerrit.wikimedia.org/r/360381 (https://phabricator.wikimedia.org/T167274) [18:05:49] (03CR) 10Ema: [V: 032 C: 032] Revert "Route around codfw for asw-d-codfw switch upgrade" [puppet] - 10https://gerrit.wikimedia.org/r/360381 (https://phabricator.wikimedia.org/T167274) (owner: 10Ema) [18:06:30] (03PS1) 10Yuvipanda: tools: Add paws nodes to clush [puppet] - 10https://gerrit.wikimedia.org/r/360397 (https://phabricator.wikimedia.org/T167086) [18:06:33] !log route ulsfo back to codfw T167274 [18:06:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:06:43] T167274: codfw row D switch upgrade - https://phabricator.wikimedia.org/T167274 [18:08:44] 10Operations, 10Gerrit, 10Patch-For-Review: Gerrit constantly throws HTTP 500 error when reviewing patches (due to "Too many open files") - https://phabricator.wikimedia.org/T168360#3364664 (10Paladox) Well i found the init script adds 1024 with core.packedGitOpenFiles so basically for us it's 1024 + 6000 bu... [18:09:39] !log netmon1002 - arm keyholder with rancid key [18:09:41] RECOVERY - Keyholder SSH agent on netmon1002 is OK: OK: Keyholder is armed with all configured keys. [18:09:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:14:01] RECOVERY - Check Varnish expiry mailbox lag on cp1073 is OK: OK: expiry mailbox lag is 9508 [18:14:38] nice ^ [18:15:25] 10Operations, 10cloud-services-team: Reboots of cloud servers - https://phabricator.wikimedia.org/T168445#3364701 (10MoritzMuehlenhoff) [18:26:01] 10Operations, 10Discovery, 10Traffic, 10Wikidata, and 2 others: runUpdate.sh script in wikidata stand-alone has abruptly started incurring numerous 429 errors. - https://phabricator.wikimedia.org/T168019#3364719 (10Lisp.hippie) Everything is running smoothly on our end. Thanks @ema and @Smalyshev ! [18:35:23] 10Operations, 10DBA: Drop wikilove_image_log table from Wikimedia wikis - https://phabricator.wikimedia.org/T127219#3364802 (10kaldari) @Marostegui: Thanks for the info about the back-up. Might be useful data for some wiki archeologist one day :) [18:35:32] (03PS2) 10Ottomata: Insert select eventbus generated event topics into eventlogging MySQL database [puppet] - 10https://gerrit.wikimedia.org/r/359212 (https://phabricator.wikimedia.org/T150369) [18:36:04] (03PS3) 10Ottomata: Insert select eventbus generated event topics into eventlogging MySQL database [puppet] - 10https://gerrit.wikimedia.org/r/359212 (https://phabricator.wikimedia.org/T150369) [18:41:55] (03CR) 10jerkins-bot: [V: 04-1] Insert select eventbus generated event topics into eventlogging MySQL database [puppet] - 10https://gerrit.wikimedia.org/r/359212 (https://phabricator.wikimedia.org/T150369) (owner: 10Ottomata) [18:45:42] (03PS4) 10Ottomata: Insert select eventbus generated event topics into eventlogging MySQL database [puppet] - 10https://gerrit.wikimedia.org/r/359212 (https://phabricator.wikimedia.org/T150369) [18:51:08] 10Operations, 10cloud-services-team: Reboots of cloud servers - https://phabricator.wikimedia.org/T168445#3364905 (10MoritzMuehlenhoff) Updated kernels have been installed (plus the related base libraries/services). [18:52:42] (03PS1) 10Framawiki: Planet-fr: Replace the RAW feed by the new one [puppet] - 10https://gerrit.wikimedia.org/r/360403 (https://phabricator.wikimedia.org/T167617) [18:52:59] (03PS2) 10Cmjohnson: Adding mgmt dns entries for ores1001-9 [dns] - 10https://gerrit.wikimedia.org/r/360395 [18:53:56] (03CR) 10Cmjohnson: [C: 032] Adding mgmt dns entries for ores1001-9 [dns] - 10https://gerrit.wikimedia.org/r/360395 (owner: 10Cmjohnson) [19:00:04] twentyafterfour: Dear anthropoid, the time has come. Please deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170620T1900). [19:01:10] 10Operations, 10ArchCom-RfC, 10Commons, 10MediaWiki-File-management, and 15 others: Define an official thumb API - https://phabricator.wikimedia.org/T66214#3364959 (10GWicke) >>! In T66214#3256693, @Tgr wrote: > We'll also need a way to display old versions of images. Clients can encounter old versions wit... [19:10:27] 10Operations, 10monitoring, 10Interactive-Sprint, 10Maps (Kartotherian), 10Technical-Debt: Geoshape and geoline subservices need monitoring - https://phabricator.wikimedia.org/T166776#3365018 (10debt) 05Open>03Resolved Moving to done - this has been merged. [19:12:02] 10Operations, 10LDAP: Update certificates on productions replicas of corp.wikimedia.org LDAP - https://phabricator.wikimedia.org/T168460#3365029 (10RobH) I'm not 100% sure we need to run that same *.corp.wikimedia.org cert. I don't see any private key file for star.corp.wikimedia.org.key in the private ops re... [19:12:24] 10Operations, 10Incident-20150423-Commons, 10MediaWiki-API, 10Parsoid, and 7 others: HHVM request timeouts not working; support lowering the API request timeout per request - https://phabricator.wikimedia.org/T97192#3365034 (10GWicke) @Joe, has this been fixed with 3.18? [19:12:34] 10Operations, 10Traffic, 10HTTPS, 10LDAP: Update certificates on productions replicas of corp.wikimedia.org LDAP - https://phabricator.wikimedia.org/T168460#3365036 (10Framawiki) [19:14:36] 10Operations, 10Traffic, 10HTTPS, 10LDAP: Update certificates on productions replicas of corp.wikimedia.org LDAP - https://phabricator.wikimedia.org/T168460#3365053 (10RobH) This is for ldap use, not https, not sure #traffic or #https or #traffic belong. [19:17:27] !log Prepping 1.30.0-wmf.6 - T167535 [19:17:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:17:39] T167535: MW-1.30.0-wmf.6 deployment blockers - https://phabricator.wikimedia.org/T167535 [19:20:51] (03PS2) 10Andrew Bogott: nodepool: lower rate of queries from 6 to 5 [puppet] - 10https://gerrit.wikimedia.org/r/358601 (https://phabricator.wikimedia.org/T167803) (owner: 10Hashar) [19:22:27] (03CR) 10Andrew Bogott: [C: 032] nodepool: lower rate of queries from 6 to 5 [puppet] - 10https://gerrit.wikimedia.org/r/358601 (https://phabricator.wikimedia.org/T167803) (owner: 10Hashar) [19:23:43] (03CR) 10Ottomata: "Looking good:" [puppet] - 10https://gerrit.wikimedia.org/r/359212 (https://phabricator.wikimedia.org/T150369) (owner: 10Ottomata) [19:23:47] (03PS5) 10Ottomata: Insert select eventbus generated event topics into eventlogging MySQL database [puppet] - 10https://gerrit.wikimedia.org/r/359212 (https://phabricator.wikimedia.org/T150369) [19:23:49] (03CR) 10Ottomata: [V: 032 C: 032] Insert select eventbus generated event topics into eventlogging MySQL database [puppet] - 10https://gerrit.wikimedia.org/r/359212 (https://phabricator.wikimedia.org/T150369) (owner: 10Ottomata) [19:26:46] 10Operations, 10Traffic, 10netops: codfw row A switch upgrade - https://phabricator.wikimedia.org/T168462#3365105 (10ayounsi) [19:28:03] jouncebot: next [19:28:03] In 2 hour(s) and 31 minute(s): Wikidata (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170620T2200) [19:29:44] (03PS1) 10Ottomata: Don't fail if userAgent not in event [puppet] - 10https://gerrit.wikimedia.org/r/360407 [19:30:25] (03CR) 10Ottomata: [V: 032 C: 032] Don't fail if userAgent not in event [puppet] - 10https://gerrit.wikimedia.org/r/360407 (owner: 10Ottomata) [19:32:41] PROBLEM - MariaDB Slave Lag: s4 on db2037 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 310.91 seconds [19:32:42] PROBLEM - MariaDB Slave Lag: s4 on db2058 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 310.95 seconds [19:32:42] PROBLEM - MariaDB Slave Lag: s4 on db2019 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 311.60 seconds [19:32:42] PROBLEM - MariaDB Slave Lag: s4 on db2065 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 313.41 seconds [19:33:11] PROBLEM - MariaDB Slave Lag: s4 on db2051 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 332.58 seconds [19:33:11] PROBLEM - MariaDB Slave Lag: s4 on db2044 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 335.91 seconds [19:35:17] 10Operations, 10LDAP: Update certificates on productions replicas of corp.wikimedia.org LDAP - https://phabricator.wikimedia.org/T168460#3365149 (10Framawiki) Oh, true, sory [19:37:23] 10Operations, 10Traffic, 10netops: codfw row A switch upgrade - https://phabricator.wikimedia.org/T168462#3365153 (10ayounsi) [19:43:11] PROBLEM - HHVM rendering on mw2123 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:44:01] RECOVERY - HHVM rendering on mw2123 is OK: HTTP OK: HTTP/1.1 200 OK - 75791 bytes in 0.300 second response time [19:48:10] 10Operations, 10Analytics, 10Traffic: Increase request limits for GETs to /api/rest_v1/ - https://phabricator.wikimedia.org/T118365#3365183 (10GWicke) >>! In T118365#3349563, @Nuria wrote: >>which matches metrics end points explicitly limited at 100/s per client IP. > > mmm... looking at pageview API dashb... [19:48:54] twentyafterfour: is the train still going out? [19:55:10] 10Operations, 10Traffic, 10netops: codfw row A switch upgrade - https://phabricator.wikimedia.org/T168462#3365193 (10ayounsi) [19:58:00] (03PS1) 10Ottomata: Don't attempt to examine max timestamp of eventlogging table if it doesn't have a timestamp field [puppet] - 10https://gerrit.wikimedia.org/r/360411 (https://phabricator.wikimedia.org/T150369) [19:59:00] (03CR) 10Ottomata: [V: 032 C: 032] Don't attempt to examine max timestamp of eventlogging table if it doesn't have a timestamp field [puppet] - 10https://gerrit.wikimedia.org/r/360411 (https://phabricator.wikimedia.org/T150369) (owner: 10Ottomata) [20:02:46] (03Draft1) 10Paladox: Gerrit: Makes sure review_site/lib exists [puppet] - 10https://gerrit.wikimedia.org/r/360412 [20:02:49] (03PS2) 10Paladox: Gerrit: Makes sure review_site/lib exists [puppet] - 10https://gerrit.wikimedia.org/r/360412 [20:03:20] legoktm: yes I'll be deploying shortly [20:03:52] PROBLEM - puppet last run on dbmonitor2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:04:46] 10Operations, 10Continuous-Integration-Infrastructure, 10Patch-For-Review, 10Release-Engineering-Team (Kanban): CI for operations/puppet is taking too long - https://phabricator.wikimedia.org/T166888#3365211 (10hashar) I have rebased the serie of patches for operations/puppet.git that slightly enhance the... [20:05:46] 10Operations, 10Continuous-Integration-Config, 10Patch-For-Review, 10Release-Engineering-Team (Kanban), and 2 others: Create a basic RSpec unit test for operations/puppet - https://phabricator.wikimedia.org/T78342#3365212 (10hashar) >>! In T166888#3360863, @hashar wrote: > >>>! In T166888#3333018, @faidon... [20:07:15] 10Operations, 10ops-codfw, 10Labs, 10Labs-Infrastructure, 10Patch-For-Review: rack/setup/install labtestpuppetmaster2001 - https://phabricator.wikimedia.org/T167157#3365221 (10Papaul) [20:08:09] !log twentyafterfour@tin Started scap: sync 1.30.0-wmf.6 refs T167535 [20:08:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:08:19] T167535: MW-1.30.0-wmf.6 deployment blockers - https://phabricator.wikimedia.org/T167535 [20:12:40] (03PS3) 10Paladox: Gerrit: Makes sure review_site/lib exists [puppet] - 10https://gerrit.wikimedia.org/r/360412 [20:17:31] PROBLEM - Improperly owned -0:0- files in /srv/mediawiki-staging on naos is CRITICAL: Improperly owned (0:0) files in /srv/mediawiki-staging [20:27:32] RECOVERY - Improperly owned -0:0- files in /srv/mediawiki-staging on naos is OK: Files ownership is ok. [20:30:02] (03PS1) 10Hashar: contint1001: upgrade git on zuul mergers [puppet] - 10https://gerrit.wikimedia.org/r/360420 (https://phabricator.wikimedia.org/T161086) [20:30:05] 10Operations, 10DBA, 10Wikimedia-Site-requests, 10Patch-For-Review: Create CoC committee private wiki - https://phabricator.wikimedia.org/T165977#3365315 (10Dereckson) [20:32:01] RECOVERY - puppet last run on dbmonitor2001 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [20:33:38] 10Operations, 10Mail: Get mail relay out of Yahoo! blacklist: apply to Yahoo for whitelisting bulk mail - https://phabricator.wikimedia.org/T58414#3365332 (10herron) [20:34:36] (03CR) 10Dereckson: [C: 04-1] "Why bump this feed to the head of the list?" [puppet] - 10https://gerrit.wikimedia.org/r/360403 (https://phabricator.wikimedia.org/T167617) (owner: 10Framawiki) [20:35:14] (03CR) 10Dereckson: [C: 04-1] "URL is correct and works fine." [puppet] - 10https://gerrit.wikimedia.org/r/360403 (https://phabricator.wikimedia.org/T167617) (owner: 10Framawiki) [20:37:27] !log twentyafterfour@tin Finished scap: sync 1.30.0-wmf.6 refs T167535 (duration: 29m 16s) [20:37:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:37:36] T167535: MW-1.30.0-wmf.6 deployment blockers - https://phabricator.wikimedia.org/T167535 [20:39:38] (03PS1) 10Urbanecm: Add two lines to NamespacesAliases for zh_classical [mediawiki-config] - 10https://gerrit.wikimedia.org/r/360450 [20:40:02] (03PS2) 10Urbanecm: Add two lines to NamespacesAliases for zh_classical [mediawiki-config] - 10https://gerrit.wikimedia.org/r/360450 (https://phabricator.wikimedia.org/T168422) [20:40:35] (03PS1) 1020after4: group0 wikis to 1.30.0-wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/360455 [20:40:37] (03CR) 1020after4: [C: 032] group0 wikis to 1.30.0-wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/360455 (owner: 1020after4) [20:42:34] (03CR) 10Dzahn: "i think we should remove the file from the package and add it in puppet, but there are different opinions where it should be and 20000 is " [debs/gerrit] - 10https://gerrit.wikimedia.org/r/360312 (https://phabricator.wikimedia.org/T168360) (owner: 10Paladox) [20:44:50] !log twentyafterfour@tin Synchronized php-1.30.0-wmf.6/includes/changes/EnhancedChangesList.php: deploy bad7bde87dc945ba7e1c307420ccbb2419ca90c9 refs T167535 (duration: 00m 53s) [20:44:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:45:00] T167535: MW-1.30.0-wmf.6 deployment blockers - https://phabricator.wikimedia.org/T167535 [20:45:21] PROBLEM - Host elastic1021 is DOWN: PING CRITICAL - Packet loss = 100% [20:45:23] (03PS3) 10Paladox: Fix systemd script to use a higher LimitNOFile value [debs/gerrit] - 10https://gerrit.wikimedia.org/r/360312 (https://phabricator.wikimedia.org/T168360) [20:46:21] RECOVERY - Host elastic1021 is UP: PING OK - Packet loss = 0%, RTA = 0.30 ms [20:49:06] (03CR) 10Dzahn: [C: 031] "+0.75" [debs/gerrit] - 10https://gerrit.wikimedia.org/r/360312 (https://phabricator.wikimedia.org/T168360) (owner: 10Paladox) [20:50:03] (03CR) 10jerkins-bot: [V: 04-1] Add two lines to NamespacesAliases for zh_classical [mediawiki-config] - 10https://gerrit.wikimedia.org/r/360450 (https://phabricator.wikimedia.org/T168422) (owner: 10Urbanecm) [20:51:26] (03CR) 10Framawiki: "You are right, I'll change the position." [puppet] - 10https://gerrit.wikimedia.org/r/360403 (https://phabricator.wikimedia.org/T167617) (owner: 10Framawiki) [20:51:29] (03CR) 10Dzahn: [C: 031] "if you guys think this is fine... i can build and upload this again" [debs/gerrit] - 10https://gerrit.wikimedia.org/r/360312 (https://phabricator.wikimedia.org/T168360) (owner: 10Paladox) [20:51:38] (03Merged) 10jenkins-bot: group0 wikis to 1.30.0-wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/360455 (owner: 1020after4) [20:51:52] (03CR) 10jenkins-bot: group0 wikis to 1.30.0-wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/360455 (owner: 1020after4) [20:52:55] (03CR) 10Dzahn: "so.. this directory was created by the package in the past.. so this was never an issue, but at some point the package stopped creating th" [puppet] - 10https://gerrit.wikimedia.org/r/360412 (owner: 10Paladox) [20:53:28] !log twentyafterfour@tin rebuilt wikiversions.php and synchronized wikiversions files: Group0 to 1.30.0-wmf.6 refs T167535 [20:53:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:53:38] T167535: MW-1.30.0-wmf.6 deployment blockers - https://phabricator.wikimedia.org/T167535 [20:54:04] (03CR) 10Paladox: "> so.. this directory was created by the package in the past.. so" [puppet] - 10https://gerrit.wikimedia.org/r/360412 (owner: 10Paladox) [20:54:07] (03PS3) 10Urbanecm: Add two lines to NamespacesAliases for zh_classical [mediawiki-config] - 10https://gerrit.wikimedia.org/r/360450 (https://phabricator.wikimedia.org/T168422) [20:54:41] !log Finished train deployment for group0, train will resume tomorrow as scheduled. [20:54:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:56:09] (03CR) 10Dzahn: "aha! thanks, this was helpful and explains why this is now needed. looks good to me, just wondering if there is a reason to drop "package " [puppet] - 10https://gerrit.wikimedia.org/r/360412 (owner: 10Paladox) [20:57:11] (03CR) 10Dzahn: [C: 031] "ignore my question, of course because the package doesn't add it anymore, yea" [puppet] - 10https://gerrit.wikimedia.org/r/360412 (owner: 10Paladox) [20:57:46] 10Operations, 10ops-codfw, 10Labs, 10Labs-Infrastructure, 10Patch-For-Review: rack/setup/install labtestpuppetmaster2001 - https://phabricator.wikimedia.org/T167157#3365385 (10Papaul) [20:58:58] (03PS2) 10Legoktm: Deploy Linter to all wikis (try #2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358887 (https://phabricator.wikimedia.org/T148609) [20:59:54] (03CR) 10Legoktm: [C: 032] Deploy Linter to all wikis (try #2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358887 (https://phabricator.wikimedia.org/T148609) (owner: 10Legoktm) [21:03:41] (03PS4) 10Mobrovac: Update recommendation-api module and role [puppet] - 10https://gerrit.wikimedia.org/r/358026 (https://phabricator.wikimedia.org/T167113) (owner: 10Nschaaf) [21:05:24] (03CR) 10jerkins-bot: [V: 04-1] Add two lines to NamespacesAliases for zh_classical [mediawiki-config] - 10https://gerrit.wikimedia.org/r/360450 (https://phabricator.wikimedia.org/T168422) (owner: 10Urbanecm) [21:11:51] (03CR) 10Nschaaf: [C: 031] Update recommendation-api module and role [puppet] - 10https://gerrit.wikimedia.org/r/358026 (https://phabricator.wikimedia.org/T167113) (owner: 10Nschaaf) [21:13:10] (03Merged) 10jenkins-bot: Deploy Linter to all wikis (try #2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358887 (https://phabricator.wikimedia.org/T148609) (owner: 10Legoktm) [21:13:21] !log labtestpuppetmaster2001 - install-console, activate puppet, sign cert, initial puppet run, add salt key (T167157) [21:13:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:13:31] T167157: rack/setup/install labtestpuppetmaster2001 - https://phabricator.wikimedia.org/T167157 [21:13:38] (03PS5) 10Mobrovac: Update recommendation-api module and role [puppet] - 10https://gerrit.wikimedia.org/r/358026 (https://phabricator.wikimedia.org/T167113) (owner: 10Nschaaf) [21:15:22] (03PS1) 10BryanDavis: Move ukwikimedia to deleted.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/360564 (https://phabricator.wikimedia.org/T168436) [21:15:54] 10Operations, 10Citoid, 10VisualEditor, 10Services (blocked), 10User-mobrovac: Wiley requests for DOI and some other publishers don't work in production - https://phabricator.wikimedia.org/T165105#3365455 (10mobrovac) >>! In T165105#3347539, @Mvolz wrote: >>>! In T165105#3347538, @Samwalton9 wrote: >> Ye... [21:17:01] (03CR) 10Mobrovac: "PCC - https://puppet-compiler.wmflabs.org/6825/" [puppet] - 10https://gerrit.wikimedia.org/r/358026 (https://phabricator.wikimedia.org/T167113) (owner: 10Nschaaf) [21:17:24] !log rebooting labvirt1014 as practice for tomorrow's security reboots [21:17:25] !log legoktm@tin Synchronized wmf-config/InitialiseSettings.php: Deploy Linter to all wikis (try #2) - T148609 (duration: 00m 44s) [21:17:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:17:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:17:42] T148609: Review and deploy Linter extension to Wikimedia wikis - https://phabricator.wikimedia.org/T148609 [21:22:04] 10Operations, 10ops-codfw, 10Labs, 10Labs-Infrastructure, 10Patch-For-Review: rack/setup/install labtestpuppetmaster2001 - https://phabricator.wikimedia.org/T167157#3365462 (10Papaul) [21:22:35] (03CR) 10jenkins-bot: Deploy Linter to all wikis (try #2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358887 (https://phabricator.wikimedia.org/T148609) (owner: 10Legoktm) [21:26:19] 10Operations, 10ops-codfw, 10Labs, 10Labs-Infrastructure, 10Patch-For-Review: rack/setup/install labtestpuppetmaster2001 - https://phabricator.wikimedia.org/T167157#3365467 (10Papaul) @Andrew this is complete you can take over from here. Thanks. [21:29:04] !log arlolra@tin Started restart [parsoid/deploy@4b60bf9]: (no justification provided) [21:29:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:30:36] i see b&w logo (which is actually smaller than usual) and b&w favicon on wmf.org site. is that on purpose? [21:30:53] (03CR) 10Dzahn: [C: 031] Gerrit: Makes sure review_site/lib exists (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/360412 (owner: 10Paladox) [21:30:57] (03PS4) 10Paladox: Gerrit: Makes sure review_site/lib exists [puppet] - 10https://gerrit.wikimedia.org/r/360412 [21:31:34] robh: ^^ [21:32:17] on https://wikimediafoundation.org/wiki/Home ? [21:32:55] i dont see anything but the normal items, i just launched a cleared session in chrome [21:33:57] robh: any page on that site - the logo on top left is smaller and b&w in ff. the favicon is b&w also [21:34:16] it all looks normal to me [21:34:29] Anyone else able to reproduce Danny_B's error? [21:34:47] 10Operations, 10Traffic, 10netops: codfw row A switch upgrade - https://phabricator.wikimedia.org/T168462#3365472 (10ayounsi) [21:34:56] the favicon for me is still in color in a clear google chrome session [21:35:21] robh: what do you see here? https://wikimediafoundation.org/static/images/project-logos/foundationwiki.png [21:35:31] a colored logo [21:35:41] i see a black/white logo in Firefox [21:35:48] but i dont know why that is [21:35:48] huh, i see color [21:36:00] i know the wmf logo changed though to black and white recently [21:36:03] !log legoktm@tin Synchronized wmf-config: touch (duration: 00m 45s) [21:36:07] and https://wikimediafoundation.org/static/favicon/wmf.ico [21:36:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:36:13] https://github.com/wikimedia/operations-mediawiki-config/commit/94703bf5ab3902b267d5942e9c871d05b7a64305 [21:36:14] (03PS1) 10Papaul: DNS: Add DNS entries for new spare systems [dns] - 10https://gerrit.wikimedia.org/r/360570 [21:36:16] yeah all of those are in color for me. [21:36:18] in Chromium it looks different again [21:36:19] in both FF and chrome [21:36:25] it is also bw but with a transparent background [21:36:27] as opposed to white [21:36:41] well, it should be a transparent background no matter what, some browsers render it differently [21:36:53] in my ff i see the png and ico both black on transparent bg [21:36:56] The github commit I linked above is the reason [21:37:20] ok :) [21:37:54] seems like color logo = old then [21:38:26] possibly different things stuck in various caches [21:38:40] 10Operations, 10ops-codfw, 10hardware-requests: reclaim/decom tmh200[12] - https://phabricator.wikimedia.org/T168472#3365481 (10RobH) [21:38:56] Danny_B: so yeah, as Reedy points out its intentional [21:38:58] huh, where was the discussion about changing the logo? [21:39:04] its wikimedia foundation [21:39:14] they had some kind of release about the logo change a few months ago iirc? [21:39:25] 10Operations, 10ops-codfw, 10Patch-For-Review: Rack/setup codfw spare systems - https://phabricator.wikimedia.org/T167705#3365494 (10Papaul) [21:39:32] T144254 [21:39:34] T144254: Update instances of Wikimedia Foundation logo - https://phabricator.wikimedia.org/T144254 [21:39:48] Reedy haz all answers [21:40:17] Reedy: i mean on meta or something [21:40:26] * Reedy shrugs [21:40:26] where community was discussing that [21:40:43] Ask on the task? [21:43:18] https://wikimediafoundation.org/wiki/Visual_identity_guidelines [21:43:58] ^ the logo of the WMF itself falls under that anyways [21:44:16] https://meta.wikimedia.org/wiki/Brand [21:44:32] I still dont get why we moved to black and white :( [21:44:36] "do not change the logo colors" heh [21:44:41] that fixes that :) [21:44:43] neither do i [21:44:55] design team? [21:45:05] nor i do remember any public discussion about that [21:45:11] sounds like some folks thought it would be a good idea to harmonize [21:45:37] since on lot of media you end up with a single color ( typically on t-shirt/print to save cost of adding some extra colors) [21:45:47] i am not sure if "trademark the logo" and "publicly discuss the design" even works together [21:45:57] then we still have logos and derivatives using blue/green/red ;d [21:45:58] first time i've seen that on wmf site, i thought somebody from the staff died and this is sort of memorial [21:46:32] IIRC https://wikimediafoundation.org/wiki/Visual_identity_guidelines is the official guideline [21:46:58] which not that many people follow / care about [21:47:38] it's a trademark and property, probably need to ask legal too [21:48:07] weren't all logos liberated? [21:48:10] someone should start a thread on wmfall!!!!!! [21:48:29] hashar: go for it! [21:49:25] looks like Heather Walls ( https://wikimediafoundation.org/wiki/User:Heather_(WMF) ) did the change in August 2016 [21:49:32] do you want to change the logo of the mailing list? ask first :p [21:49:32] so she would know for sure [21:50:23] i still don't see a reason for the perfect logo to be changed [21:50:51] did coca cola or nivea ever changed their logo? no. and they are most valuable logos [21:50:57] No one in here is going to know the answer to that ;] [21:51:16] yeah, sorry... [21:51:41] id suggest the phab task though, phab seems to get results better than mailing lists imo [21:51:45] for questions [21:51:56] rfc on meta [21:52:20] yeah phab would be a good way to drive the switch [21:53:24] anyway bd time! [21:53:29] logo on phab is white [21:53:32] * mutante hides [21:54:01] well, actually.... ;) (big brands do change their logo semi frequently) [21:54:41] diet coke can has changed 4 times in my lifetime that i recall. [21:54:51] diabetics recall this kind of thing ;D [21:54:56] ;) [21:55:01] "Coca-Cola Light" for Germany [21:55:12] yea its less insulting than diet coke [21:55:14] heh [21:55:17] hehe [21:55:34] PROBLEM - mediawiki originals uploads -hourly- for codfw-prod on graphite1001 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [3000.0] [21:55:39] but the coca cola and the wave remains the same [21:56:24] we can find examples to support either claim (coca cola vs starbucks, for instance), there is no one right way. Let's move on from this topic of what is "right" with logos [21:56:24] PROBLEM - mediawiki originals uploads -hourly- for eqiad-prod on graphite1001 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [3000.0] [21:56:28] wmf logo did not only change to b&w but also changed the font used [21:57:29] there is usually both, a color and a b/w version of a logo (without changing the font though) [21:57:51] and where you use which depends on context [22:00:04] aude: Respected human, time to deploy Wikidata (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170620T2200). Please do the needful. [22:00:12] they will both be trademarked.. like the b/w version of the coca-cola wave or this https://en.wikipedia.org/wiki/Coca-Cola_Life [22:01:34] RECOVERY - mediawiki originals uploads -hourly- for codfw-prod on graphite1001 is OK: OK: Less than 80.00% above the threshold [2000.0] [22:02:56] (03CR) 10Dzahn: [C: 032] Gerrit: Makes sure review_site/lib exists [puppet] - 10https://gerrit.wikimedia.org/r/360412 (owner: 10Paladox) [22:03:10] thanks [22:03:26] confirming that on gerrit2001 first [22:03:34] RECOVERY - mediawiki originals uploads -hourly- for eqiad-prod on graphite1001 is OK: OK: Less than 80.00% above the threshold [2000.0] [22:04:20] mode: mode changed '0755' to '0555' [22:04:52] now cobalt.. no change [22:05:11] gerrit2001 actually wasn't identical [22:05:22] all good [22:09:44] hi [22:16:10] 10Operations, 10Scoring-platform-team-Backlog: Keep wmflabs scoring boxes up-to-date - https://phabricator.wikimedia.org/T168478#3365588 (10awight) [22:16:22] (03PS5) 10Aude: Enable sitelinks on Wikidata for Wiktionary pages outside main namespace (phase 1) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/360336 (https://phabricator.wikimedia.org/T158323) [22:16:27] (03CR) 10Aude: [C: 032] Enable sitelinks on Wikidata for Wiktionary pages outside main namespace (phase 1) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/360336 (https://phabricator.wikimedia.org/T158323) (owner: 10Aude) [22:17:45] (03Merged) 10jenkins-bot: Enable sitelinks on Wikidata for Wiktionary pages outside main namespace (phase 1) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/360336 (https://phabricator.wikimedia.org/T158323) (owner: 10Aude) [22:17:55] (03CR) 10jenkins-bot: Enable sitelinks on Wikidata for Wiktionary pages outside main namespace (phase 1) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/360336 (https://phabricator.wikimedia.org/T158323) (owner: 10Aude) [22:21:55] testing on mwdebug [22:29:09] 10Operations, 10DBA, 10Goal: Migrate MySQLs to use ROW-based replication - https://phabricator.wikimedia.org/T109179#3365630 (10MZMcBride) Sorry, I misunderstood the scope of this task. I thought this task was about Wikimedia Labs using row-based replication, not Wikimedia production. I think I'm actually lo... [22:30:12] still testing/ adding wikibase db table [22:39:15] updating the sites tables [22:41:35] 10Operations, 10DBA, 10Goal: Migrate MySQLs to use ROW-based replication - https://phabricator.wikimedia.org/T109179#1542524 (10bd808) >>! In T109179#3365630, @MZMcBride wrote: > In the context of Wikimedia Labs, the word testing is confusing to me. Isn't all of Labs for testing? It's nice to hear that the d... [22:48:41] 10Operations, 10MW-1.30-release-notes, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, 10Patch-For-Review: Create Atikamekw Wikipedia - https://phabricator.wikimedia.org/T167714#3342848 (10jhsoby) That seems to work fine now! [22:49:17] !log created wbc_entity_usage table and updated sites table on wiktionary wikis [22:49:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:50:31] alright [22:52:44] PROBLEM - mediawiki originals uploads -hourly- for codfw-prod on graphite1001 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [3000.0] [22:53:04] RECOVERY - MariaDB Slave Lag: s4 on db2037 is OK: OK slave_sql_lag Replication lag: 57.09 seconds [22:53:04] RECOVERY - MariaDB Slave Lag: s4 on db2019 is OK: OK slave_sql_lag Replication lag: 57.24 seconds [22:53:04] RECOVERY - MariaDB Slave Lag: s4 on db2065 is OK: OK slave_sql_lag Replication lag: 58.40 seconds [22:53:34] RECOVERY - MariaDB Slave Lag: s4 on db2044 is OK: OK slave_sql_lag Replication lag: 54.27 seconds [22:53:45] RECOVERY - MariaDB Slave Lag: s4 on db2058 is OK: OK slave_sql_lag Replication lag: 50.45 seconds [22:54:24] RECOVERY - MariaDB Slave Lag: s4 on db2051 is OK: OK slave_sql_lag Replication lag: 0.44 seconds [22:54:44] RECOVERY - mediawiki originals uploads -hourly- for codfw-prod on graphite1001 is OK: OK: Less than 80.00% above the threshold [2000.0] [22:55:34] ah, good there's nothing in swat [22:55:41] hopefully won't be too much longer though [22:56:34] PROBLEM - mediawiki originals uploads -hourly- for eqiad-prod on graphite1001 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [3000.0] [22:57:44] PROBLEM - mediawiki originals uploads -hourly- for codfw-prod on graphite1001 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [3000.0] [22:59:35] !log aude@tin Synchronized wmf-config/InitialiseSettings.php: Enable Wikibase (phase 1) on Wiktionary wikis (duration: 00m 44s) [22:59:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:00:04] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Dear anthropoid, the time has come. Please deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170620T2300). [23:00:44] RECOVERY - mediawiki originals uploads -hourly- for codfw-prod on graphite1001 is OK: OK: Less than 80.00% above the threshold [2000.0] [23:02:34] RECOVERY - mediawiki originals uploads -hourly- for eqiad-prod on graphite1001 is OK: OK: Less than 80.00% above the threshold [2000.0] [23:03:55] !log aude@tin Synchronized wmf-config/Wikibase-production.php: Remove temp wiktionary site link settings for test wikidata (duration: 00m 43s) [23:04:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:04:47] now just removing unneeded test/beta settings [23:05:15] !log aude@tin Synchronized wmf-config/Wikibase-labs.php: Remove temp wiktionary site link settings (duration: 00m 44s) [23:05:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:06:14] !log aude@tin Synchronized wmf-config/InitialiseSettings-labs.php: Remove temp wiktionary site link settings (duration: 00m 43s) [23:06:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:06:32] ok, done [23:07:30] don't see anything in the logs related to this [23:07:34] PROBLEM - Upload HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [23:08:39] 10Operations, 10ops-eqiad, 10Labs, 10Labs-Infrastructure, and 2 others: rack/setup/install labvirt101[5-8] - https://phabricator.wikimedia.org/T165531#3365759 (10RobH) p:05High>03Normal [23:09:24] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [23:11:27] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [1000.0] [23:13:26] (03PS2) 10Dzahn: DNS: Add DNS entries for new spare systems [dns] - 10https://gerrit.wikimedia.org/r/360570 (owner: 10Papaul) [23:17:10] 10Operations, 10Ops-Access-Requests: Access request for Daniel Worley to analytics / hadoop - https://phabricator.wikimedia.org/T168439#3364503 (10Dzahn) Hi, here are the existing access groups and their description: https://wikitech.wikimedia.org/wiki/Analytics/Data_access#Access_Groups Does one of them see... [23:18:01] (03CR) 10Dzahn: [C: 032] DNS: Add DNS entries for new spare systems [dns] - 10https://gerrit.wikimedia.org/r/360570 (owner: 10Papaul) [23:20:43] (03CR) 10Dzahn: "what is including what here? wikipedia includes the wordpress feed?" [puppet] - 10https://gerrit.wikimedia.org/r/360403 (https://phabricator.wikimedia.org/T167617) (owner: 10Framawiki) [23:21:21] (03CR) 10Dzahn: [C: 031] "@Giuseppe i think this is better now compared to when you added the -2" [puppet] - 10https://gerrit.wikimedia.org/r/354041 (owner: 10Paladox) [23:24:27] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [23:35:37] RECOVERY - Upload HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [23:37:27] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [23:38:29] 10Operations, 10ops-eqiad, 10Labs, 10Labs-Infrastructure, and 2 others: rack/setup/install labvirt101[5-8] - https://phabricator.wikimedia.org/T165531#3365877 (10RobH) a:05RobH>03Cmjohnson Chris: Please wire up eth1 on these systems and label their ports on the switch. Then you or I can take a look a... [23:38:51] (03CR) 10Dzahn: [C: 031] "cool, i was wondering about that on netmon1002 - i will NOT move it" [puppet] - 10https://gerrit.wikimedia.org/r/351260 (owner: 10Alexandros Kosiaris) [23:51:05] 10Operations, 10HyperSwitch, 10RESTBase-API, 10Traffic, 10Services (next): Respect host header in RESTBase, and redirect /rest_v1 to /rest_v1/ - https://phabricator.wikimedia.org/T167972#3351561 (10Pchelolo) When I'm trying to implement this internally within #restbase I hit into T168481 so that one sho...