[00:02:41] (03PS1) 10Alex Monk: horizon: change wikitech help URLs to use https [puppet] - 10https://gerrit.wikimedia.org/r/309704 [00:05:52] (03PS1) 10Alex Monk: Remove wikitech references from ldapconfig [puppet] - 10https://gerrit.wikimedia.org/r/309705 [00:06:54] (03PS1) 10Alex Monk: labtest hiera: use labtestwikitech, not wikitech [puppet] - 10https://gerrit.wikimedia.org/r/309706 [00:09:11] (03PS1) 10Alex Monk: dns-floating-ip-updater: use python's ipaddress class to determine PTR FQDNs for IPs [puppet] - 10https://gerrit.wikimedia.org/r/309708 [00:14:27] (03PS1) 10Alex Monk: openstack: Update monitor_labs_salt_keys.py for new Nova API version [puppet] - 10https://gerrit.wikimedia.org/r/309709 (https://phabricator.wikimedia.org/T123607) [00:15:23] I think that's twelve patches now [00:15:34] I should stop uploading stuff [00:16:04] git stash list is empty now though [00:18:28] get it off your local harddrive :) [00:19:05] I probably still have a ton of branches with one local commit with the message "WIP" [00:36:02] (03CR) 10RobH: [C: 032] robh on vacation, remove from paging [puppet] - 10https://gerrit.wikimedia.org/r/309694 (owner: 10RobH) [00:36:44] maybe I should write a script to go through the branches, look at the change-ids and print any not already in gerrit [00:36:52] ... then take that script, commit it and upload it to gerrit :D [00:40:48] PROBLEM - puppet last run on mw1152 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/gen_fingerprints] [00:50:17] (03PS3) 10Krinkle: contint: Remove 'integration/phpcs' deployment source [puppet] - 10https://gerrit.wikimedia.org/r/301523 [01:01:31] git branch | xargs -n 1 echo > branches [01:01:31] git branch | xargs -n 1 echo | xargs -n 1 git show --stat | grep Change-Id | grep -o "I[a-f0-9]*$"| xargs -n 1 ssh gerrit gerrit query --format=json | grep stats > rets [01:01:52] then in python: [01:02:20] import json [01:02:20] for branch, json_ret in zip(open('branches').read().splitlines(), open('rets').read().splitlines()): [01:02:20] if json.loads(json_ret)['rowCount'] == 0: [01:02:20] print('git show --stat ' + branch) [01:05:02] RECOVERY - puppet last run on mw1152 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [01:16:25] RECOVERY - All k8s etcd nodes are healthy on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 1.188 second response time [01:50:22] PROBLEM - Postgres Replication Lag on maps-test2002 is CRITICAL: CRITICAL - Rep Delay is: 1800.809776 Seconds [01:52:43] RECOVERY - Postgres Replication Lag on maps-test2002 is OK: OK - Rep Delay is: 130.190731 Seconds [02:40:23] !log mwdeploy@tin scap sync-l10n completed (1.28.0-wmf.18) (duration: 17m 53s) [02:40:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:46:34] !log l10nupdate@tin ResourceLoader cache refresh completed at Sat Sep 10 02:46:34 UTC 2016 (duration 6m 11s) [02:46:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:57:16] 06Operations, 06Performance-Team, 10Traffic, 10Wikimedia-Stream, and 2 others: HTTPS-only for stream.wikimedia.org - https://phabricator.wikimedia.org/T140128#2625077 (10AlexMonk-WMF) [02:57:20] 06Operations, 10Pywikibot-core, 10Traffic, 07HTTPS, 13Patch-For-Review: rcstream support defaults to stream.wikimedia.org:80 - https://phabricator.wikimedia.org/T145244#2625076 (10AlexMonk-WMF) 05Open>03Resolved [03:00:49] 06Operations, 06Performance-Team, 10Traffic, 10Wikimedia-Stream, and 2 others: HTTPS-only for stream.wikimedia.org - https://phabricator.wikimedia.org/T140128#2625078 (10AlexMonk-WMF) That's at least one bot in tools fixed. Can you filter those access logs down to labs entries only (208.80.155.128 - 208.80... [03:48:12] (03CR) 1020after4: [C: 031] l10nupdate: aquire scap lock before changing files [puppet] - 10https://gerrit.wikimedia.org/r/303923 (https://phabricator.wikimedia.org/T72752) (owner: 10BryanDavis) [03:49:50] (03CR) 1020after4: ""Disable PHP always_populate_raw_post_data"" [puppet] - 10https://gerrit.wikimedia.org/r/309214 (owner: 1020after4) [03:49:58] (03CR) 1020after4: [C: 031] phabricator php.ini: set always_populate_raw_post_data = -1 [puppet] - 10https://gerrit.wikimedia.org/r/309214 (owner: 1020after4) [04:54:27] PROBLEM - Make sure enwiki dumps are not empty on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/dumps - 288 bytes in 0.051 second response time [09:13:42] 06Operations, 10Traffic: Telkom/8ta (South Africa) users cannot connect to wikimedia sites - https://phabricator.wikimedia.org/T145270#2625236 (10Glaisher) [09:15:18] 06Operations, 10Traffic: Telkom/8ta (South Africa) users cannot connect to wikimedia sites - https://phabricator.wikimedia.org/T145270#2625248 (10Glaisher) [09:15:34] 06Operations, 10Traffic, 10netops: Telkom/8ta (South Africa) users cannot connect to wikimedia sites - https://phabricator.wikimedia.org/T145270#2625250 (10Peachey88) p:05Triage>03High @Glaisher: If the user, or another one experiencing the same issue are still on IRC, could you please ask them to comple... [09:15:54] 06Operations, 10Traffic, 10netops: Telkom/8ta (South Africa) users cannot connect to wikimedia sites - https://phabricator.wikimedia.org/T145270#2625253 (10Glaisher) p:05High>03Unbreak! [09:16:10] 06Operations, 10Traffic, 10netops: Telkom/8ta (South Africa) users cannot connect to wikimedia sites - https://phabricator.wikimedia.org/T145270#2625256 (10Glaisher) p:05Unbreak!>03High [09:16:18] p858snake: already asked them to add it [09:19:31] 06Operations, 10Traffic, 10netops: Telkom/8ta (South Africa) users cannot connect to wikimedia sites - https://phabricator.wikimedia.org/T145270#2625236 (10Cadar) Hi everybody, and thanks for all the help! Here is the ping: Pinging en.wikipedia.org [91.198.174.192] with 32 bytes of data: Request timed out.... [09:21:25] 06Operations, 10Traffic, 10netops: Telkom/8ta (South Africa) users cannot connect to wikimedia sites - https://phabricator.wikimedia.org/T145270#2625272 (10Cadar) Here are the details of the tracert: Tracing route to en.wikipedia.org [91.198.174.192] over a maximum of 30 hops: 1 * * *... [09:24:29] 06Operations, 10Traffic, 10netops: Telkom/8ta (South Africa) users cannot connect to wikimedia sites - https://phabricator.wikimedia.org/T145270#2625273 (10Cadar) If you need any other information, let me know. I am currently connecting to Wikpedia, Wikimedia and Phabricator via an IP hiding service based in... [09:39:47] 06Operations, 10Traffic, 10netops: Telkom/8ta (South Africa) users cannot connect to wikimedia sites - https://phabricator.wikimedia.org/T145270#2625275 (10Cadar) My own IP address is currently 41.144.175.* so I think the IP range might be larger than the one officially published. I'm trying to track down a... [09:55:17] 06Operations, 10Traffic, 10netops: Telkom/8ta (South Africa) users cannot connect to wikimedia sites - https://phabricator.wikimedia.org/T145270#2625281 (10faidon) Telkom is announcing their routes directly to us over AMS-IX via the route servers, not a direct peering. However, it seems that they're blackhol... [09:55:18] PROBLEM - check_mysql on fdb2001 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 2500 [10:00:10] RECOVERY - check_mysql on fdb2001 is OK: Uptime: 901355 Threads: 1 Questions: 49535509 Slow queries: 4862 Opens: 2826 Flush tables: 2 Open tables: 536 Queries per second avg: 54.956 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [10:11:36] 06Operations, 10Traffic, 10netops: Telkom/8ta (South Africa) users cannot connect to wikimedia sites - https://phabricator.wikimedia.org/T145270#2625298 (10faidon) I mailed Telkom's NOC. I'll follow up here when I hear back. [10:16:12] 06Operations, 10Traffic, 10netops: Telkom/8ta (South Africa) users cannot connect to wikimedia sites - https://phabricator.wikimedia.org/T145270#2625300 (10Cadar) Many thanks for the help! I have disabled my IP hider and seem to be able to connect fine now. I've also asked a couple of other people to check t... [10:45:09] (03CR) 10Ladsgroup: [C: 031] "I checked pywikibot tests and it seems this change doesn't break tests pywikibot does. Mostly because it makes a page in the test instance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/208655 (https://phabricator.wikimedia.org/T94416) (owner: 10Aude) [10:47:32] PROBLEM - puppet last run on db2046 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:55:14] (03CR) 10Merlijn van Deen: [C: 031] "lgtm (did not test)." [puppet] - 10https://gerrit.wikimedia.org/r/309216 (https://phabricator.wikimedia.org/T144955) (owner: 10BryanDavis) [11:09:39] (03CR) 10Merlijn van Deen: [C: 031] toollabs::proxy: Restrict to labs networks [puppet] - 10https://gerrit.wikimedia.org/r/309524 (owner: 10Muehlenhoff) [11:12:23] RECOVERY - puppet last run on db2046 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:13:47] 06Operations, 10Traffic, 10netops: Telkom/8ta (South Africa) users cannot connect to wikimedia sites - https://phabricator.wikimedia.org/T145270#2625346 (10Cadar) New tracert results: Tracing route to wikipedia.org [91.198.174.192] over a maximum of 30 hops: 1 * * * Request timed ou... [11:33:34] RECOVERY - Make sure enwiki dumps are not empty on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.024 second response time [11:47:47] ostriches: replace the page with {{Archived extension}} not delete [11:55:47] oh right, Translate stuff [11:55:56] someone else can fix those up [12:32:56] ACKNOWLEDGEMENT - cassandra-b CQL 10.192.32.138:9042 on restbase2004 is CRITICAL: Connection refused eevans More data corruption (see: http://paabricator.wikimedia.org/144826). [12:34:57] 06Operations, 10Cassandra, 06Services: restbase2004.codfw.wmnet data corruption - https://phabricator.wikimedia.org/T144826#2625394 (10Eevans) [12:36:42] !log T144826: Removing compaction rate limit, increasing compactor threads (from 10 to 20), and beginning scrub of local_group_wikipedia_T_parsoid_html.data (restbase2004-b.codfw.wmnet) [12:36:45] T144826: restbase2004.codfw.wmnet data corruption - https://phabricator.wikimedia.org/T144826 [12:36:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:43:59] 06Operations, 10Cassandra, 06Services: restbase2004.codfw.wmnet data corruption - https://phabricator.wikimedia.org/T144826#2625412 (10Eevans) [14:57:45] PROBLEM - puppet last run on ganeti2002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:25:14] RECOVERY - puppet last run on ganeti2002 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [15:32:47] (03PS1) 10Dereckson: Women in Science (Vancouver, BCIT) throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/309737 (https://phabricator.wikimedia.org/T145253) [15:42:09] There is an Unbreak now! for Wikimedia Commons, with a revert patch by MatmaRex. https://phabricator.wikimedia.org/T145228 https://gerrit.wikimedia.org/r/309736 [15:43:07] MatmaRex asserts 100-200 uploads was affected. [15:44:09] greg-g: I'd advice a merge of this patch right now, as this is a rather high impact [15:44:21] MatmaRex: that only touches Special:Upload, and not the wizard? [15:45:15] if no one is around, i think it could wait until monday morning. i'm only working today because i slacked off on friday. [15:45:57] there are 100-200 instances in the logs a day, so presumably it means this many failed uploads per day, but it's impossible to tell whether those users were able to upload the files eventually [15:46:33] yes, only affects Special:Upload (and only if you get a warning on the first upload attempt); all tools using the API are fine, including UploadWizard [15:47:01] In the https://phabricator.wikimedia.org/T145272 duplicate, tonythomas had a failure [15:48:35] i have not seen anyone complain about this until that bug report ^ so i think it can be deployed on monday. presumably newbie users use UploadWizard anyway, and advanced users rarely make mistakes that result in warnings on Special:Upload (and the warning-less path works fine). so this would mostly affect users of wikis other than Commons, which don't have UploadWizard. [15:49:06] unfortunately the logging for fatals is really bad and it looks like we can't even tell which wiki the errors come from [15:49:42] * Dereckson tries a reasonnable scenario to repro. [15:50:20] repo is to upload any file on special:upload with a bad filename, then correct it and try again [15:50:28] repro* [15:51:26] File extension ".png" does not match the detected MIME type of the file (image/jpeg). [15:51:37] I tried that to trigger a warning. [15:52:22] note that there's a difference between warning and errors; that one might be an error? [15:52:34] Upload somewhat worked. [15:52:43] But I got a The file you uploaded seems to be empty. This might be due to a typo in the filename. Please check whether you really want to upload this file. [15:52:46] essentially, error = can't upload this file; warning = can upload this file after correcting filename or description [15:52:53] and needed to reselect the file [16:08:17] PROBLEM - mediawiki-installation DSH group on mw2077 is CRITICAL: Host mw2077 is not in mediawiki-installation dsh group [16:23:54] PROBLEM - mediawiki-installation DSH group on mw2079 is CRITICAL: Host mw2079 is not in mediawiki-installation dsh group [16:25:33] (03PS1) 10Dereckson: Cleanup: squid.php → ReverseProxy.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/309742 (https://phabricator.wikimedia.org/T104148) [16:26:22] (03CR) 10Dereckson: "CachingProxy.php could also be a good name." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/309742 (https://phabricator.wikimedia.org/T104148) (owner: 10Dereckson) [16:31:04] (03Abandoned) 10Hoo man: Specify contact groups in wdqs::monitor::services [puppet] - 10https://gerrit.wikimedia.org/r/308989 (owner: 10Hoo man) [16:35:46] Dereckson: thank you for bringing this one up :( hope it gets fixed soon. Our Gerrit tutorial miss some essential screenshots :D it was not allowing me to upload anyway (some abusefilter something too came up when using the visual thing) [16:35:55] PROBLEM - mediawiki-installation DSH group on mw2078 is CRITICAL: Host mw2078 is not in mediawiki-installation dsh group [16:36:34] (03PS1) 10Dereckson: Update noc.wikimedia.org dblist files [mediawiki-config] - 10https://gerrit.wikimedia.org/r/309743 [16:36:58] tonythomas: why do you want tu upload files on mediawiki.org by the way? Upload them to Commons directly. [16:37:33] Dereckson: commons was rejecting it showing up something like 'this contains some kind of inappropriate content for Wikimedia Commons' etc [16:37:38] MatmaRex: fatalmonitor doesn't show the error in the last 5000, so yes it's low currently [16:37:45] https://phabricator.wikimedia.org/T145272#2625573was the file anyway [16:42:43] https://commons.wikimedia.org/wiki/File:Gerrit_SSH_connection_test.png [16:44:21] I dind't got an error message. [16:47:14] Dereckson: hmm. Dont know. I was getting that all the way. adding this one to the Gerrit tutorial, anyway \ [16:47:17] Thank you. [16:48:39] you're welcome [16:48:58] I created a Gerrit category if you see other screenshots Gerrit-related. [16:57:34] Dereckson: great. There are couple of them in https://www.mediawiki.org/wiki/Gerrit/Tutorial I guess [16:58:01] yaeh, feel free to add to them [[Category:Gerrit]] [16:58:09] Dereckson: will do. Thank you! [18:06:28] https://status.wikimedia.org/ - GeoIP lookup is down [18:07:22] Dereckson ^^ as your available do you have access to that? [18:07:24] please [18:07:36] he doesn't [18:07:48] https://status.wikimedia.org/156490/GeoIP-lookup [18:07:51] Oh ok [18:07:54] Luke_, it's no longer running [18:07:58] just needs the check removing [18:08:40] ah, ok [18:08:51] I have access to our VPS that proxies it, but not the backend [18:09:33] I imagine ops have some webadmin panel password for it somewhere in their store [19:25:53] 06Operations, 10Mail, 10OTRS, 10Wiki-Loves-Monuments: E-mails not being received by OTRS - https://phabricator.wikimedia.org/T145293#2625948 (10AlexMonk-WMF) @wikimedia.nl and @wmnederland.nl won't work, they're not in files/exim/wikimedia_domains @wikilovesmonuments.eu and @wikilovesmonuments.nl should work [19:43:43] 06Operations, 10Mail, 10OTRS, 10Wiki-Loves-Monuments: E-mails not being received by OTRS - https://phabricator.wikimedia.org/T145293#2625955 (10Ciell) Both @wikimedia.nl and @wmnederland.nl belong to the Dutch Chapter WMNL, yes. [19:58:12] 06Operations, 10DNS, 10Domains, 10Traffic, and 2 others: Point wikipedia.in to 180.179.52.130 instead of URL forward - https://phabricator.wikimedia.org/T144508#2602033 (10Ryan_Lane) >> But as our infrastructure code is completely open, you're welcome to submit patches that would do so! > > //*grin*// Yea... [20:00:29] 06Operations, 10Mail: mx1001/2001 - Exim SMTP - Certificate expires Sep 22 2016 - https://phabricator.wikimedia.org/T144568#2603832 (10AlexMonk-WMF) Part A: I guess it needs either procurement (`openssl s_client -starttls smtp -crlf -connect mx1001.wikimedia.org:25` shows the current cert is from GlobalSign) o... [22:12:53] PROBLEM - puppet last run on elastic2011 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:30:22] !log T144826: Restarting Cassandra on restbase2004-b.codfw.wmnet (scrub complete, re-joining cluster) [22:30:23] T144826: restbase2004.codfw.wmnet data corruption - https://phabricator.wikimedia.org/T144826 [22:30:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:33:17] RECOVERY - cassandra-b CQL 10.192.32.138:9042 on restbase2004 is OK: TCP OK - 0.044 second response time on port 9042 [22:37:55] RECOVERY - puppet last run on elastic2011 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [23:28:11] (03PS1) 10Dereckson: Configure Visual Editor namespaces on sv.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/309808 (https://phabricator.wikimedia.org/T144688) [23:47:17] PROBLEM - puppet last run on mw1221 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/apache2/sites-enabled/api-rewrites.incl]